M249 Practical modern 
statistics 


The Open 
University 


t9 


Computer Book 3 


Multivariate analysis 


About this module 


M249 Practical modern statistics uses the software packages IBM SPSS Statistics (SPSS Inc.) 
and WinBUGS, and other software. This software is provided as part of the module, and its 
use is covered in the Introduction to statistical modelling and in the four computer books 
associated with Books 1 to 4. This computer book contains all the computer work associated 
with Book 3. 


Cover image courtesy of NASA. This photograph, acquired by the ASTER instrument on 
NASA's Terra satellite, shows an aerial view of a large alluvial fan between the Kunlun and 
Altun mountains in China's Xinjiang province. For more information, see NASA's Earth 
Observatory website at http://earthobservatory.nasa.gov. 


This publication forms part of an Open University module. Details of this and other 
Open University modules can be obtained from the Student Registration and Enquiry 
Service, The Open University, PO Box 197, Milton Keynes MK7 6BJ, United Kingdom 
(tel. +44 (0)845 300 60 90; email general-enquiries@open.ac.uk). 


Alternatively, you may visit the Open University website at www.open.ac.uk where 
you can learn more about the wide range of modules and packs offered at all levels by 
The Open University. 


To purchase a selection of Open University materials visit www.ouw.co.uk, or contact 
Open University Worldwide, Walton Hall, Milton Keynes MK7 6AA, United Kingdom 
for a brochure (tel. +44 (0)1908 858779; fax +44 (0)1908 858787; email 
ouw-customer-services@open.ac.uk). 


The Open University, Walton Hall, Milton Keynes MK7 6AA. 
First published 2007. Second edition 2013. 
Copyright © 2007, 2013 The Open University 


All rights reserved. No part of this publication may be reproduced, stored in a 
retrieval system, transmitted or utilised in any form or by any means, electronic, 
mechanical, photocopying, recording or otherwise, without written permission from 
the publisher or a licence from the Copyright Licensing Agency Ltd. Details of such 
licences (for reprographic reproduction) may be obtained from the Copyright 
Licensing Agency Ltd, Saffron House, 6-10 Kirby Street, London ECIN 8TS 
(website www.cla.co.uk). 


Open University materials may also be made available in electronic formats for use 

by students of the University. All rights, including copyright and related rights and 
database rights, in electronic materials and their contents are owned by or licensed 

to The Open University, or otherwise used by The Open University as permitted by 
applicable law. 


In using electronic materials and their contents you agree that your use will be solely 
for the purposes of following an Open University course of study or otherwise as 
licensed by The Open University or its assigns. 

Except as permitted above you undertake not to copy, store in any medium 
(including electronic storage or use in a website), distribute, transmit or retransmit, 
broadcast, modify or show in public such electronic materials in whole or in part 
without the prior written consent of The Open University or in accordance with the 
Copyright, Designs and Patents Act 1988. 


Edited, designed and typeset by The Open University, using the Open University 
TEX System. 
Printed in the United Kingdom by The Charlesworth Group, Wakefield. 


ISBN 978 1 7800 7663 8 
2.1 


Contents 


Chapter 1 Scatterplots and profile plots 
1.1 Customizing scatterplots 
1.2 Matrix scatterplots 
1.3 Profile plots 


Chapter 2 Numerical summaries and standardization 
Chapter 3 Principal component analysis 


Chapter 4 Extracting and plotting principal 
components 


Chapter 5 Canonical discriminant analysis 
Chapter 6 Allocation 

Computer Exercises on Book 3 

Learning outcomes 

Solutions to Computer Activities 

Solutions to Computer Exercises 


Index 


on B® ep 


12 
16 


21 
25 
30 
34 
36 
37 
50 
56 


Chapter 1 
Scatterplots and profile plots 


In the Introduction to statistical modelling, you learned how to obtain a scatterplot 
in SPSS. In Section 1.1, you will learn how to use different plotting symbols to 

identify groups, and how to label individual points on a scatterplot. You will learn 
how to obtain a matrix scatterplot in Section 1.2 and a profile plot in Section 1.3. 


1.1 Customizing scatterplots 


Computer Activity 1.1 Identifying groups on a scatterplot 


Data on the average house price and the average household income in 353 local 
authorities in England are described in Activity 2.1 of Book 3. In this activity, you 
will obtain a scatterplot of the data using different colours and plotting symbols 
for the different regions. You will also learn how to label individual points. 


The data are in the file housing.sav. Open this file now. There are four 
variables: region, the region in which the authority is located; authority, the 
name of the authority; houseprice, the average price of a 4/5 room house; and 
income, the average household income for working households with one member 
aged between 20 and 39. 


Obtain a scatterplot of houseprice and income with data for authorities in 
different regions plotted using different colours, as follows. 

© Obtain the Scatter/Dot dialogue box. 

© Select Simple Scatter and click on Define. The Simple Scatterplot 


dialogue box will open. 

© Enter houseprice in the Y Axis field and income in the X Axis field. 

© Enter region in the Set Markers by field. This variable will be used to 
identify groups in the data. 

© Enter authority in the Label Cases by field. This variable will be used to 
label individual points on the scatterplot. 

© Click on OK. 


A scatterplot similar to that shown in Figure 1.1 will be displayed in the Viewer 
window. 


All the data files for this 
computer book are located in 
the Book 3 subfolder of the 
M249 Data Files folder. 


Graphs > Legacy Dialogs > 
Scatter /Dot... 


Chapter 1 Scatterplots and profile plots 


region 
lO East Midlands 


North West 
O South East 
South West 
West Midlands 
Yorkshire and Humberside 


houseprice 





0 20000 40000 60000 80000 100000 


income 


Figure 1.1  Scatterplot of average house price and average household income 


Points corresponding to different regions will be plotted using different colours on 
your computer screen (but not in Figure 1.1). If you need to print a graph in 
black and white, it may be better to use different plotting symbols to identify 
groups in the data than to use different colours. Choose a different plotting 
symbol for local authorities in the London region, as follows. 


© Place the mouse pointer on the scatterplot and double-click to open the 
Chart Editor. 


© Click once on any point on the scatterplot to select (all) the points. 


© Identify a point corresponding to a local authority in the London region and 
click on it once. Only the points corresponding to local authorities in the 
London region will now be selected. 

© Double-click on a point corresponding to an authority in the London region 
to open the Properties dialogue box. 


© If necessary, click on the Marker tab to bring the Marker panel uppermost. 


© In the Marker area, click on the Type down arrow and select a triangle 
from the plotting symbols displayed. 


Now change the colour of the triangles, as follows. 


© In the Color area, click on Fill, then click on the black rectangle in the 
colour palette. 


© In the Color area, click on Border, then click on the black rectangle in the 
colour palette. 


© Click on Apply, then on Close to close the Properties dialogue box. 
© Close the Chart Editor. 


The points on the scatterplot corresponding to authorities in the London region 
will be represented as black triangles with black borders. You can change the 
plotting symbols for other regions using this method if you wish. In Figure 1.2, 
the points representing authorities in all regions other than London are plotted 
using circles with black borders. 


Computer Book 3 





600000 4 


400000 4 


houseprice 


200000 4 











T T T T T 
0 20000 40000 60000 80000 100000 


income 


Figure 1.2 Scatterplot with edited plotting symbols 


Now label some of the unusual points with the names of the local authorities to 
which they correspond, as follows. 


o 
o 


o 


Double-click on the scatterplot to open the Chart Editor. 


Identify the gun sight icon in the Chart Editor toolbar: this is the icon 
with the two squares and crosshairs. (When the mouse pointer is placed on 
this icon, the label Data Label Mode appears.) 

Click on the gun sight icon. 

Move the mouse pointer over the scatterplot; the pointer will change to a gun 
sight. 

Place the gun sight over the point in the top right-hand corner of the plot, 
and click. The label Kensington and Chelsea will appear near to the point. 
Label a few other unusual points with the names of the local authorities to 
which they correspond. (Note that you can remove a label by placing the gun 
sight over the point and clicking.) 


Close the Chart Editor. 


A scatterplot of the data with three of the points labelled is shown in Figure 1.3. 


Chapter 1 Scatterplots and profile plots 





600000 4 Kensington and Chelsea 


400000 4 


Hammersmith and Fulham 


houseprice 


200000 4 


City of London 








04 








T T T T T 
0 20000 40000 60000 80000 100000 


income 


Figure 1.3 Scatterplot with unusual points labelled 


Computer Activity 1.2 Crime in the USA 


'The SPSS data file crimeloss1.sav contains data on loss from crime and 
expenditure on policing in the 50 states of the USA in 1966. Open this file. There 
are four variables: state, region (the region of the USA in which the state is 
located), totexp (the total per capita expenditure on policing), and totloss (the 
total per capita cost of crime). All costs are in US dollars. 


(a) Obtain a scatterplot with totexp on the y-axis and totloss on the z-axis. 
Edit the scatterplot so that states in different regions are represented using 
black plotting symbols of different shapes. On the scatterplot, label the state 
with the highest expenditure on policing and the state with the lowest 
expenditure on policing. 

(b) Describe the overall relationship between total loss due to crime and total 
expenditure on policing. How does this relationship differ between the South 
and the West regions? 


1.2 Matrix scatterplots 


In this section, instructions are given for obtaining a matrix scatterplot. This is 
done using Scatter/Dot... from the Legacy Dialogs submenu of the Graphs 
menu. 


Computer Activity 1.3 Creating a matrix scatterplot 


Data on the performance of eleven-year-old primary school students in 150 local 
education authorities in English, Mathematics and Science are described in 
Example 1.3 of Book 3. The data are in the file lea.sav. Open the file now. 
Obtain a matrix scatterplot of the three variables english, maths and science, 
as follows. 


O Obtain the Scatter/Dot dialogue box. 


© Select Matrix Scatter by clicking on it, and click on Define. The 
Scatterplot Matrix dialogue box will open. 


© Enter the three variables english, maths and science in the Matrix 
Variables field. The order in which you enter the variables dictates the 
order in which the variables are presented in the resulting plot, but otherwise 
does not matter. 


Graphs > Legacy Dialogs > 
Scatter/Dot... 


Computer Book 3 


© Enter the variable lea in the Label Cases by field. (This makes it possible 
to identify individual data points.) 
© Click on OK. 


'The matrix scatterplot that will be displayed in the Viewer window is shown in 
Figure 1.4(a). 














rd 


Isles of Scilly 






























































o o 
L L 
m E 
© o 
c c 
o [7] 
Isles of Scilly 
uw [4] o 
= <= 
= = 
© © 
£ E 
o 
ao o 
Q cialis 
o v Isles of Scilly 
v v T 
c c 2 
m Kn 
U [9 
a a 
o 
english maths Science english 
(a) (b) 


Figure 1.4 Matrix scatterplots of LEA performance data 


It is sometimes useful to label unusual points, such as potential outliers. Do this 
now, as follows. 


© Double-click on the matrix scatterplot to open the Chart Editor. 

© In the toolbar of the Chart Editor, click on the gun sight icon. (When 
moved over the scatterplot the mouse pointer will now resemble a gun sight.) 

© In the maths/english panel identify the outlier corresponding to an 
authority with a high Mathematics score but a low English score, and click 
on it. 

© Close the Chart Editor. 


The matrix scatterplot will be as shown in Figure 1.4(b). Notice that a label 
identifying the point has been added to the scatterplot; in this case, the point 
corresponds to the Isles of Scilly. This label is included not only on the 
maths/english panel but on the other panels as well. 


Unless a point is well separated from the rest it is not always entirely clear to 
which point a label is attached. Thus labelling points in this way is useful 
primarily for identifying points that are well separated from the rest. 


Computer Activity 1.4 Cost of crime and policing 


In Computer Activity 1.2, you obtained a scatterplot of total expenditure on 
policing against total cost of crime in the 50 US states using the data in the file 
crimeloss1.sav. In the data file crimeloss2.sav, the variable totexp, the total 
per capita expenditure on policing, has been replaced by two variables — 
stateexp, the per capita state expenditure on policing, and localexp, the per 
capita local expenditure on policing. Also, the variable totloss has been replaced 
by two variables — proploss, the per capita property loss due to crime, and 
persloss, the per capita personal loss due to crime. Open the file now. 





Chapter 1 Scatterplots and profile plots 


(a) Obtain a matrix scatterplot of the variables proploss, persloss, stateexp 
and localexp. Specify the variable state to label any unusual points to be 
identified later. 


(b) Describe the relationships between state and local expenditure on policing, 
and between property and personal loss. 


(c) Consider the panel relating to local expenditure on policing and property 
loss. Identify an unusual point on this scatterplot, and label it. Explain in 
what way the point is unusual. 


(d) Does the point you identified in part (c) appear to be unusual on any other 
panel? If so, explain in what way. 


1.3 Profile plots 


To draw a profile plot in SPSS, the data file must be transformed so that the 
variables are in the rows and the observations are in the columns. This process is 
called transposing the data. Then the Line Charts dialogue box is used to 
obtain a multiple line plot of the transposed data. The method is illustrated in 
Computer Activity 1.5. 


Computer Activity 1.5 Creating profile plots 


Data on the uptake of trace elements by the plant Echinacea purpurea are 
described in Example 1.4 of Book 3, and a profile plot of these data is presented in 
Example 2.5. In this activity, you will obtain such a profile plot. 


The data are in the file echinacea.sav. Open the file now. In this data file, the 
logarithms of the concentrations of the different elements are in the columns. 
These are the variables: 1Cu, 1Fe, and so on. The observations, which correspond 
to the different parts of the plants sampled, are in the rows. These data were all 
collected in the autumn. 


First transpose the data, as follows. 
© Choose Transpose... from the Data menu. The Transpose dialogue box 
will open. 


© Enter the variables 1Cu, 1Fe, 1Mn, 1Zn, 1Ni, 1Li, 1Sr, 1Mg and 1Ca in the 
Variable(s) field. These variables will form the rows of the transformed data 
set. 

© Enter sample in the Name Variable field. This ensures that the columns in 
the transformed data table are appropriately labelled. 

© Click on OK. 

Look at the transposed data in the Data View panel of the Data Editor. There 

are now nine cases (each representing one of the chemical elements measured) and 

five variables (one for each part of the plant studied), together with a variable 

called CASE LBL which contains labels for each row. Now obtain the profile plot, 

which is a multiple line plot of the transposed data, as follows. 

© Obtain the Line Charts dialogue box. 

© Select Multiple. 

© Inthe Data in Chart Are area, select Values of individual cases. 

o 


Click on Define. The Define Multiple Line: Values of Individual 
Cases dialogue box will open. 


o 


Enter the variables root, stem, leaf, flower and herbs in the Lines 
Represent field. 


© In the Category Labels area, select Variable and enter the variable 
CASE LBL in its field. 


© Click on OK. 


'The letter 1 in the variable 
names is for log. 


Graphs > Legacy Dialogs > 
Line... 


Computer Book 3 


The profile plot that will be displayed in the Viewer window will be similar to 
that shown in Figure 1.5. 





— root 
12.004 =="stem 
= =leaf 
= “flower 
— herbs 


9.00 


6.004 


Value 


3.0043 


005 














T T T T T T T T T 
ICu IFe IMn IZn INi ILi ISr IMg ICa 
CASE_LBL 


Figure 1.5 Profile plot of the Echinacea data (autumn samples) 


To obtain the plot in Figure 1.5 the lines were altered so that they can be 
distinguished more easily when printed. Change the appearance of the lines now, 
as follows. 


Open the Chart Editor by double-clicking on the profile plot. 

Within the Chart Editor, click on any line to highlight (all) the lines. 
Double-click on any line to open the Properties dialogue box. 

If necessary, click on the Lines tab to bring the Lines panel uppermost. 


Click on the line you wish to change. This line will then be highlighted in the 
Chart Editor. 


Select the style and colour you prefer for the line, and click on Apply. 
Repeat the last two steps for each line you wish to alter. 

Click on Close to close the Properties dialogue box. 

Close the Chart Editor. 


The categories can be rearranged on the horizontal axis, so that the elements are 
placed (roughly) in order of decreasing (or increasing) concentrations. This may 
make it easier to describe the plot. Do this now, as follows. 


o 
o 


o 


o © 


Open the Chart Editor by double-clicking on the profile plot. 


Click on the large X in the Chart Editor toolbar. The Properties dialogue 
box will open. 


Bring the Categories panel uppermost by clicking on the Categories tab. 


In the Order field, click on the variable 1Fe to highlight it, then move it one 
place down the list by clicking on the down arrow to the right of the field. 


Move 1Fe until it is third from bottom of the list. 
Click on Apply. 


Re-order the categories so that the concentrations increase from left to right. 


10 


Alternatively, choose 
Properties from the Edit menu 
of the Chart Editor. 


Variables can be moved up the 
list by using the up arrow. 


Chapter 1 Scatterplots and profile plots 


The plot can also be edited to display more meaningful axis labels. Figure 1.6 Editing the axis labels is 


shows the profile plot, suitably customized. described in the Introduction to 
statistical modelling. 





log concentrations 








element 


Figure 1.6 Customized profile plot of the Echinacea data 


Computer Activity 1.6 Iberian hams 


'This activity will give you some practice at obtaining and customizing a profile 
plot. It is based on data relating to the design of an artificial olfactory system to 
test Iberian hams. Fifteen different sensors were used to detect the volatile Garcia, M. et al. (2003) 
compounds emanating from samples of ham, using a technique called static Artificial olfactory system for 
headspace generation. The experiment was repeated in eight different settings: the classification of Iberian 

" " : : ] hams. Sensors and Actuators B, 
two temperatures (30°C and 50°C), two durations (20 minutes and 40 minutes), 96, 621-629. 
and two amounts of ham (2g and 5g). The aim of the experiment was to identify 
which setting produced the best overall responses from the fifteen sensors. 


'The data are in the file hams.sav. Open the file now. There are sixteen 
variables. The variable setting lists the eight settings used, and the variables s1 
to s15 give the responses of the fifteen sensors. These responses have been 
normalized, that is, they have been transformed so that the maximum value 
recorded at each sensor is 1. 


(a) What constitutes the observations in this data set? 
(b) Obtain a profile plot of the normalized responses for the eight settings. 


(c) Identify the two settings that produce the best normalized responses. Explain 
your choice. 


Summary of Chapter 1 


In this chapter, you have learned how to represent groups on a scatterplot, how to 
label points, and how to obtain a matrix scatterplot. Transposing data, obtaining 
a profile plot and customizing a plot have also been described. 


ilil 


Chapter 2 
Numerical summaries and standardization 


In this chapter, you will learn how to obtain numerical summaries of a 
multivariate data set, and how to standardize variables in SPSS. There are several 
ways of obtaining mean vectors in SPSS — for example, using Descriptives. . . 
from the Descriptive Statistics submenu of Analyze, or using Means... 

from the Compare Means submenu of Analyze. Variables can be standardized 
and saved using Descriptives.... Covariance matrices and correlation matrices 
are found using Bivariate... from the Correlate submenu of Analyze. The 
calculation of mean vectors is illustrated in Computer Activity 2.1. 


Computer Activity 2.1 Obtaining mean vectors 


In Computer Activity 1.1, you obtained a scatterplot of data on average house 
price and average household income in 353 local authorities in England, using 
different plotting symbols for authorities in different regions. In this activity, you 
will calculate the mean vector for house price and household income. You will also 
obtain the mean vectors for the regions. 


The data are in the file housing.sav. Open the file now. The file contains four 
variables: region, the region in which the authority is located; authority, the 
local authority area; houseprice, the average house price within the local 
authority area; and income, the average household income. Obtain the mean 
vector for houseprice and income, as follows. 


© Choose Descriptives... from the Descriptive Statistics submenu of 
Analyze. The Descriptives dialogue box will open. 


© Enter the variables houseprice and income in the Variable(s) field. 
© Click on OK. 
The following table will be displayed in the Viewer window. 


Descriptive Statistics 


81 563.273 
8990.826 


houseprice 
income 
Valid N (listwise) 


353 
353 
353 





Amongst other things, this table gives the means for the variables houseprice 
and income based on all of the data. Thus the mean vector for the data overall, 
rounded to the nearest pound, is (121054, 33 427). 


The mean vector for each region can be found using Explore... as described in Analyze > Descriptive 
the Introduction to statistical modelling. An alternative is to use Means... from Statistics > Explore... 
the Compare Means submenu of Analyze. Do this now, as follows. 


© Choose Means... from the Compare Means submenu of Analyze. The 
Means dialogue box will open. 


© Enter houseprice and income in the Dependent List field. 
© Enter region in the Independent List field. 
© Click on OK. 


12 


Chapter 2 Numerical summaries and standardization 


'The means of houseprice and income for each of the nine English regions will be 
given in the Report output table in the Viewer window. Thus, for example, the 
mean vector for the East Midlands region is (90586, 29843). The means for the 
data as a whole (all regions together) are also given at the bottom of the table. 


You will need the file housing.sav in Computer Activity 2.2, so do not close it. 


The Descriptive Statistics output table in Computer Activity 2.1 also 
includes standard deviations. Together with the means, these can be used to 
standardize the variables. However, it is possible to obtain standardized variables 
directly in SPSS using the Descriptives dialogue box. This is illustrated in 
Computer Activity 2.2. 


Computer Activity 2.2 Standardizing variables 


Open the file housing.sav if it is not still open. Obtain standardized versions of 
the variables houseprice and income, as follows. 


© Obtain the Descriptives dialogue box. 

© Enter the variables houseprice and income in the Variable(s) field. 
© Check Save standardized values as variables (by clicking on it). 
© Click on OK. 


Look at the Data View panel of the Data Editor. Notice that two new 
variables named Zhouseprice and Zincome have been added to the data file. 
'These are the standardized versions of houseprice and income. Now look at the 
Variable View panel. Notice that SPSS has included labels for Zhouseprice 
and Zincome. Delete these labels to reduce clutter. The variables Zhouseprice 
and Zincome have mean 0 and standard deviation 1. Check this now, as follows. 


© Obtain the Descriptives dialogue box. 


© Remove the variables houseprice and income from the Variable(s) field, 
and enter the variables Zhouseprice and Zincome. 


© Deselect Save standardized values as variables by clicking on it. (If you 
do not do this, SPSS will calculate further standardized variables, named 
ZZhouseprice and ZZincome.) 


© Click on OK. 


The Descriptive Statistics output table confirms that Zhouseprice and 
Zincome have mean 0 and standard deviation 1. 


For multivariate data, calculating variances or standard deviations is seldom 
sufficient. The whole covariance matrix is usually required. This is obtained using 
Bivariate... from the Correlate submenu of Analyze. The method is 
illustrated in Computer Activity 2.3. 


Computer Activity 2.3 Covariance and correlation matrices 


In this activity, you will obtain the covariance matrix and the correlation matrix 
for the mathematical ability data set which is described in Activity 2.3 of Book 3. 
The data are in the file mathsability.sav. Open the file now. There are six 
variables. The variable boy identifies each boy. The variable form indicates which 
form the boy was in at the time he sat the exams. The variable age gives the age 
(in years) of the boy at the time of the exams. The three variables geometry, 
arithmetic and algebra give the scores on each of the three exams. 


Analyze » Descriptive 
Statistics > Descriptives... 


13 


Computer Book 3 


In SPSS, the correlation matrix is obtained more directly than the covariance 

matrix. Thus, in this activity, you will obtain the correlation matrix for the three 

exam score variables and age before obtaining the covariance matrix. 

© Obtain the Bivariate Correlations dialogue box. Analyze > Correlate > 


© Enter the four variables age, geometry, arithmetic and algebra in the Bivariate: s 


Variables field. 

In the Correlation Coefficients area, make sure that Pearson is selected. 
In the Test of Significance area, make sure that Two-tailed is selected. 
Also make sure that Flag significant correlations is checked. 

Click on OK. 

The following output table will be displayed in the Viewer window. 


oOo © © 


Correlations 


Lo | age | geometry | arithmetic | algebra | 


Pearson Correlation 
Sig. (2-tailed) 
N 

geometry Pearson Correlation 
Sig. (2-tailed) 
N 

arithmetic Pearson Correlation 
Sig. (2-tailed) 
N 


algebra Pearson Correlation 
Sig. (2-tailed) 
N 
** Correlation is significant at the 0.01 level (2-tailed). 





The (Pearson) correlation coefficient between each pair of variables is given in the 
table. In addition, SPSS has carried out a significance test of the null hypothesis 
of zero correlation, and the p value for the test and the number of observations 
from which each coefficient was calculated are given. For example, the correlation 
between the scores for geometry and algebra is 0.548, and the p value is reported 
as .000, which means that p < 0.0005. Correlations for which the p value is less 
than 0.01 are starred. Note that a p value is not calculated for the correlation 
coefficient between a variable and itself, because such coefficients necessarily take 
the value 1. 


Now obtain the covariance matrix, as follows. 


© Obtain the Bivariate Correlations dialogue box. The settings you have 
just made will have been retained. 


© Click on the Options... button. The Bivariate Correlations: Options 
dialogue box will open. 


© In the Statistics area, select Cross-product deviations and covariances. 
© Click on Continue, then click on OK. 


14 


Chapter 2 Numerical summaries and standardization 


'The following output table will be displayed in the Viewer window. 


Correlations 


po | age | geomety | arithmetic | algebra 


Pearson Correlation 1 .099 à 169 
Sig. (2-tailed) atl : .128 


Sum of Squares and 
Cross-products 


Covariance 1.456 3.030 A 4.812 
N 83 83 83 
geometry Pearson Correlation 099 540" 548" 
Sig. (2-tailed) 371 .000 .000 


Sum of Squares and 
Cross-products 


Covariance 3.030 637.597 330.666 327.306 


119.400 248.487 394.616 


248.487 | 52282.916 | 27114.651 | 26839.060 


N 83 83 83 83 
arithmetic Pearson Correlation 3 540" 668" 


Sig. (2-tailed) : .000 .000 


Sum of Squares and 
Cross-products 


Covariance ‘ 330.666 588.932 383.768 
N 83 83 83 
algebra Pearson Correlation .169 548. 668 
Sig. (2-tailed) 128 .000 .000 


Sum of Squares and 
Cross-products 


Covariance 4.812 327.306 383.768 559.712 
N 83 83 83 83 
“Correlation is significant at the 0.01 level (2-tailed). 


27114.651 | 48292.410 | 31468.964 


394.616 | 26839.060 | 31468.964 | 45896.386 








The correlation coefficients you obtained previously are given in this table. In 
addition, the covariances are given in the rows labelled Covariance. For example, 
the covariance between the scores in geometry and algebra is 327.306. Note that 
the variances are also given in the Covariance rows. For example, the variance of 
the geometry scores is 637.597. 


Computer Activity 2.4 will give you some practice at obtaining and interpreting 
mean vectors, covariance matrices and correlation matrices. 


Computer Activity 2.4 The cost of crime in the United States 


In Computer Activity 1.4, you obtained a matrix scatterplot of data on the 
property and personal cost of crime, and the state and local expenditure on 
policing in the 50 states of the USA. The data are in the file crimeloss2.sav. 
Open the file now. 


(a) Obtain the mean vector for the variables proploss, persloss, stateexp and 
localexp. Comment briefly, where appropriate, on the relative magnitude of 
the elements of the mean vector. 


(b) Obtain the correlations and covariances between the four variables. Identify 
the two variables with the highest covariance, and the two variables with the 
highest correlation. Explain why these pairs are not the same. 


(c) Write down the lower triangle of the correlation matrix, and briefly interpret 
it. 


Summary of Chapter 2 


In this chapter, you have learned how to obtain mean vectors, correlation matrices 
and covariance matrices in SPSS. You have also learned how to obtain mean 
vectors for subgroups of data, and how to save standardized variables. 


15 


Chapter 3 
Principal component analysis 


In this chapter, you will learn how to use SPSS to undertake a principal 
component analysis. PCA is carried out using the Factor Analysis dialogue box, 


which is obtained using Factor... from the Dimension Reduction submenu of 

Analyze. PCA based on standardized data is illustrated in Computer 

Activity 3.1. Factor analysis is a different technique from principal component Factor analysis is not covered in 
analysis, but SPSS does both under the heading ‘Factor Analysis’. M249. 


Computer Activity 3.1 Analysing standardized data 


The file mathsability.sav contains the mathematical ability data described in 
Computer Activity 2.3. Open the file now. In this activity, you will undertake a 
principal component analysis of the four variables age, geometry, arithmetic 
and algebra. By default, SPSS standardizes the variables and bases the PCA on 
the standardized variables. Carry out a principal component analysis based on 
the standardized variables, as follows. 


© Choose Factor... from the Dimension Reduction submenu of Analyze. 
The Factor Analysis dialogue box will open. 


© Enter the variables age, geometry, arithmetic and algebra in the 
Variables field. 


© Click on OK. 


Three output tables will be displayed in the Viewer window. The first table, the 
Communalities table, is reproduced below. 


Communalities 


P| inia! | Extraction | 


age 


geometry 
arithmetic 


algebra 


Extraction Method: Principal 
Component Analysis. 





This table allows you to check that a principal component analysis has been done 
on the standardized data. The figures in the Initial column are the variances of 
the variables included in the analysis. Each variance is 1.000, as it should be for 
standardized data. Also notice that there is a message below the table stating 
that the ‘Extraction Method’ is Principal Component Analysis. (You should 
ignore the Extraction column.) 


'The second table is the Total Variance Explained output table, shown below. 


Total Variance Explained 


Initial Eigenvalues Extraction Sums of Squared Loadings 
Component % of Variance | Cumulative % i ariance | Cumulative 96 
54.827 54.827 





25.019 79.845 
12.379 92.224 
7.776 100.000 
Extraction Method: Principal Component Analysis. 





16 


Chapter 3 Principal component analysis 


This table is in two parts: Initial Eigenvalues and Extraction Sums of 
Squared Loadings. There are four rows, one for each principal component. There 
are four components because the data set is four-dimensional. In general, the 
number of components (and hence the number of rows) corresponds to the number 
of variables entered in the Variables field in the Factor Analysis dialogue box. 


Look first at the columns headed Initial Eigenvalues. The Total column gives 
the variance of each principal component. So, for example, the variance of the 
first principal component is 2.193. Note also that the sum of the entries in this 
column is 2.193 + 1.001 + 0.495 + 0.311 = 4.000, as expected since the analysis is 
based on standardized data. The remaining two columns, % of Variance and 
Cumulative %, give the percentage variance explained by each principal 
component, and the cumulative percentage variance explained. 


Now look at the second part of the table, headed Extraction Sums of Squared 
Loadings. The entries for components 1 and 2 in the first part of the table are 
duplicated, but those for components 3 and 4 are not. The completed rows 
correspond to 'extracted' components — that is, components that are deemed 
important enough to retain. By default, SPSS uses Kaiser's criterion. Thus only 
the first two components, which both have variances greater than 1, are shown. 


The third table is the Component Matrix output table, shown below. 


Component Matrix? 


age 
geometry 
arithmetic 
algebra 


Extraction Method: Principal 
Component Analysis. 


a. 2 components extracted. 





This table gives the loadings associated with the two ‘extracted’ components. The 
first column gives the loadings associated with the first principal component. The 
second column gives the loadings associated with the second principal component. 


In Part II of Book 3, the following constraint was placed on the loadings 
Qk1,+++,Akp associated with the kth principal component: 


p 
2 _ 
j 


SPSS uses a different convention for the loadings given in the Component Matrix 
output table. Let oj,,... js denote the loadings quoted by SPSS for the kth 
principal component Y;. The convention used by SPSS is that the sum of the 
squares of the loadings should equal the variance V (Yp). That is, 


To obtain the value aj; defined in Book 3, the loading o;; quoted by SPSS in the 
Component Matrix table must be divided by the standard deviation of the 
component, which may be obtained from the Total column of the Total 
Variance Explained table. That is, 


akj = Oy / V V (Yk). 
In this case, V (Y1), the variance of the first principal component, is 2.193. So 
each loading associated with the first component must be divided by 2.193. 


The terminology used by SPSS 
is different from that used in 
Book 3. You should focus on 
what the output means, not on 
what SPSS calls it. 


Kaiser’s criterion is discussed in 
Subsection 8.3 of Book 3. 


17 


Computer Book 3 


The loadings vector, with the variables in the order geometry, arithmetic, 
algebra and age, is therefore as follows: 


feeds Gan E NM e 
V/V) yV(à) yV() VQ) 
_ ( 0.807 0.857 0.878 0.192 
i Es /2.193' /2.193’ oan] 
~ (0.545, 0.579, 0.593, 0.130). 


The loadings ay, = 0.545, o4» = 0.579, o43 = 0.593 and o4 = 0.130 are those 
given in Example 7.10 of Book 3. 





Computer Activity 3.2 Trace chemicals in water from wells 


Data on the concentrations of 25 trace chemicals in 41 samples of well water in 
Nevada, USA, are described in Example 2.4 of Book 3. The samples were taken at 
different locations and depths. The aim is to gain an understanding of the 
underground water system: if two samples of water from different locations have 
different compositions, it is less likely that water can leach from one location to 
the other than if the two samples have a similar composition. 


The data are in the file wells.sav. Open the file now. There are 26 variables. The 
variable sample identifies the 41 observations. The other 25 variables denote the 
log concentrations of the 25 trace elements. 


(a) Obtain the mean and standard deviation of the log concentration of each of 
the 25 trace elements. Which variable has the largest standard deviation? 
Which variable has the smallest standard deviation? Explain why it is 
sensible to base a principal component analysis of these 25 variables on 
standardized data. 


(b) Carry out the PCA using standardized data. How many components are Analyze > Dimension 
retained using Kaiser's criterion? Reduction > Factor... 


(c) Obtain the percentage variance explained by each retained component. What 
is the CPVE by the first two components? What is the CPVE by all retained 
components? 


(d) Write down the loadings a}, and a3, calculated by SPSS for the variable 
logLi for the first and second principal components. Hence obtain the 
corresponding loadings aj; and o1 for logLi under the constraint used in 
Book 3. 


As you have seen, by default SPSS produces a principal component analysis based 
on standardized data. This is sensible given that such an analysis is often more 
appropriate than an analysis based on the unstandardized data. However, 
sometimes an analysis of the unstandardized data is required. Carrying out a 
principal component analysis based on unstandardized data is illustrated in 
Computer Activity 3.3. 


Computer Activity 3.3 Analysing unstandardized data 


A data set on the performance of local education authorities was described in 
Computer Activity 1.3. 'The data consist of the percentages of eleven-year-old 
primary school students passing standard tests in the three subjects English, 
Mathematics and Science. Each of these three variables is therefore measured on 
the same scale. So, for these three variables, standardization is not essential prior 
to obtaining principal components. 


18 


Chapter 3 Principal component analysis 


The data are in the file lea.sav. Open the file now. Obtain a principal component 
analysis based on the three unstandardized variables english, maths and 
Science, as follows. 

© Obtain the Factor Analysis dialogue box. 

© Enter english, maths and science in the Variables field. 


© Click on the Extraction... button. The Factor Analysis: Extraction 
dialogue box will open. 

© Inthe Analyze area, select Covariance matrix. 

© Click on Continue, then click on OK. 

As in Computer Activity 3.1, three output tables will be displayed in the Viewer 

window. However, each table is twice as large as when analysing the standardized 

data, and comprises two sub-tables, labelled Raw and Rescaled. You should 

ignore the parts of the three tables labelled Rescaled, and concentrate on the Raw 


part of each table. The Raw part of each output table is interpreted in the same 
way as for the analysis based on standardized data. 


'The Communalities output table is reproduced below. 


Communalities 


ETÀ ee 


ction 
english . 2.934 1.000 875 
maths 9.776 nd 1.000 .858 
science k 10.050 1.000 .851 
Extraction Method: Principal Component Analysis. 





The variances of the three variables english, maths and science are given in the 
Initial column of the Raw sub-table. For example, the variance of english is 
26.218. 


The Total Variance Explained output table is as follows. 


Total Variance Explained 


Initial Eigenvalues” Extraction Sums of Squared Loadings 


Component Total of Variance | Cumulative % Total- % of Variance | Cumulative % 


Raw 49.948 6.405 86.405 86.405 86.405 
6.155 64 97.053 
1.703 2.947 100.000 

Rescaled 6.405 86.405 86.111 86.111 
100.000 


w N 


nN — 








w 


Extraction Method: Principal Component Analysis. 
a. When analyzing a covariance matrix, the initial eigenvalues are the same across the raw and rescaled solution. 


The interpretation of the Raw sub-table is the same as for output from PCA based 
on standardized data. The total variance explained is 


TVE = 49.948 + 6.155 + 1.703 = 57.806. 





Hence the percentage variance explained by the first principal component is 


x 100% ~ 86.406%. 





The average variance of the three principal components is 57.806/3 ~ 19.269. 
SPSS applies Kaiser’s criterion, so it retains only those components with variance 
greater than the average. Thus, based on this criterion, only the first principal 
component is retained. 


The SPSS loadings for the first principal component are given in the Raw part of 
the Component Matrix output table. The loadings are o, = 4.789, aj, = 4.119 
and oj4 = 3.170. Dividing aj,, aj and a4 by 49.948, the square root of the 
variance of the first principal component, gives o1 ~ 0.678, o2 ~ 0.583 and 
013 % 0.449. These are the loadings given in Example 7.3 of Book 3. 


Analyze > Dimension 
Reduction > Factor... 


The default setting is 
Correlation matrix, which 
specifies the analysis based on 
standardized data. 


The slight difference between 
this value and the quoted value 
of 86.405% is due to rounding 
error. 


19 


Computer Book 3 


Computer Activity 3.4 A more detailed analysis of mathematical 
ability 


Data on the mathematical ability of schoolboys were analysed in Computer 
Activity 3.1. The mathematical ability of the schoolboys was actually measured 
using nine scores, A, B, ..., I. The scores A, B, C and D measured different 
aspects of geometrical ability, scores E, F and G measured different aspects of 
arithmetical ability, and scores H and I measured different aspects of algebraical 
ability. These more detailed data are in the file mathsability2.sav. Open this 
file now. The first six variables are the same as in the file mathsability.sav. 
There are a further nine variables giving the scores A, D, ..., I. 


(a) Obtain the principal component analysis of the scores A, B, ..., I based on 
the standardized data. 


(i) How many components are retained? Write down the cumulative 
percentage variance explained by the retained components. 


(ii) Use the loadings calculated by SPSS (and the description of the scores 
provided above) to interpret these components. (Hint: use the signs of 
the loadings.) 


(b) Obtain the principal component analysis of the scores A, B, ..., I based on 
the unstandardized data. 


(i) How many components are retained? Write down the cumulative 
percentage variance explained by the retained components. Compare 
your results with those you obtained in part (a)(i). 


(ii) Identify the three scores with the largest loadings (in absolute value) for 
the first principal component. Comment on these, in the light of the 
variances of the scores. 


(iii) Use the loadings calculated by SPSS to interpret the retained 
components. 


(c) Briefly discuss whether the analysis based on standardized data or the 
analysis based on unstandardized data is preferable. 


Summary of Chapter 3 


In this chapter, you have learned how to carry out a principal component analysis 
in SPSS with standardized data and with unstandardized data. The 
interpretation of the loadings produced by SPSS has been discussed. You have 
learned how to calculate the loadings under the constraint used in Book 3 from 
the loadings given by SPSS. 


20 


See Example 2.3 of Book 3. 


Chapter 4 
Extracting and plotting principal components 


In Chapter 3, you saw that, by default, SPSS uses Kaiser's criterion to select the 
number of components to be retained. Although information is provided on the 
variance, the percentage variance and the cumulative percentage variance 
explained for all of the p components that may be obtained for a p-dimensional 
data set, the loadings are displayed only for the extracted components. 


In this chapter, you will learn how to obtain a scree plot in order to decide for 
yourself how many components to retain. You will also learn how to extract these 
components, save them, and plot them. 


In Computer Activity 3.4, you carried out principal component analyses based on 
standardized data and on unstandardized data for the mathematical ability data 
in the file mathsability2.sav. You will use this data file in Computer 

Activities 4.1, 4.2, 4.3 and 4.4. If possible, you should do these four activities in 
one session. 


Computer Activity 4.1 Obtaining a scree plot 
In this activity, you will obtain a scree plot for the mathematical ability data. 


Open the file mathsability2.sav now (if it is not already open). Obtain a scree 
plot for the principal components based on standardized data, as follows. 


© Obtain the Factor Analysis dialogue box. Analyze > Dimension 
© Click on Reset to ensure that the SPSS default settings are used. Reduction > Factor: 
© Enter the variables A, B, ..., I in the Variables field. 

© Click on the Extraction... button. The Factor Analysis: Extraction 


dialogue box will open. 
© In the Display area, select Scree plot. Leave all other options unchanged. 
© Click on Continue, then click on OK. 


In addition to three output tables, the scree plot shown in Figure 4.1 will be 
displayed in the Viewer window. 


Scree Plot 


5 C 

4 

3 

2 

1 

É 
T T T T T T T T T 
1 E 3 4 5 6 7 8 9 


Component Number 





Eigenvalue 





Figure 4.1 Scree plot for the mathematical ability data 


The vertical axis of the scree plot is labelled Eigenvalue: this is the term used by 
SPSS to describe the variance of a principal component. The scree plot shows 
that the variance drops sharply between components 1 and 2. 


Computer Book 3 


After that, the variances decline slowly but steadily. The elbow in this plot occurs 
at component 2. This suggests that the most appropriate number of components 
to retain for these data is one. (Two components were retained in part (a) of 
Computer Activity 3.4 using Kaiser's criterion.) 


In Computer Activity 3.4, you saw the first two principal components based on 
the standardized data have a simple interpretation. However, it is a good idea to 
look at the loadings for one or more additional components, or all the 
components, to see whether any of them also have intuitive interpretations. To do 
this, you may need to override the SPSS default settings, since with these settings 
SPSS uses Kaiser's criterion to determine the number of components to retain. 


Computer Activity 4.2 Specifying the number of components to be 
retained 


You should still have the data file mathsability2.sav open. Specify the number 
of components to be retained in a principal component analysis of the data, as 
follows. 


© Obtain the Factor Analysis dialogue box. The settings used in Computer 
Activity 4.1 will have been retained. 


© Check that the variables A to I are in the Variables field. 


© Click on Extraction... to open the Factor Analysis: Extraction 
dialogue box. 


© Deselect Scree Plot. 


© Inthe Extract area, select Fixed number of factors and enter 3 in the 
Factors to extract field. This ensures that the loadings for the first three 
components will be displayed in the Component Matrix table. 


Since the dimension of the data is 9, you could have specified any number up to 9. 
Note that if you request more principal components than there are variables, 
SPSS will revert to using Kaiser’s criterion for determining the number of 
components to retain. 


© Click on Continue, then click on OK. 


In the Viewer window, notice that the Extraction Sums of Squared Loadings 
part of the Total Variance Explained table contains entries for three 
components, with a CPVE of 82.764%. The loadings for the first three 
components are displayed in the Component Matrix table, which is reproduced 
below. 


Component Matrix? 


Component 


aie 
A ; 

B ; 

C 

D AE 

E : 

F 3 

G : 

H i 

| ; 





Extraction Method: Principal Component 
Analysis. 


a. 3 components extracted. 


Pag 


Do not close the file 
mathsability2.sav as you will 
need it in Computer 

Activity 4.2. 


Chapter 4 Extracting and plotting principal components 


The loadings for the third principal component are small in absolute value for 
scores A, B, C and D, which correspond to geometrical ability. The scores E, F 
and G, which correspond to arithmetical ability, have positive loadings. The 
scores H and I, which relate to algebraical ability, are negative. Thus the third 
component contrasts algebraical and arithmetical ability. 


You will need the settings used in this activity in Computing Activity 4.3, so if 
possible you should do that activity now. 


The purpose of principal component analysis is to reduce the dimension of the 
data so as to explore its structure more easily. One way to do this is to plot the 
first two components in a scatterplot (or the first few components in a matrix 
scatterplot). But first, the principal components must be calculated and saved. 
Instructions for doing this are given in Computer Activity 4.3. 


Computer Activity 4.3 Extracting principal components 


In this activity you will use SPSS to calculate the first three principal components 
for the mathematical ability data set, and plot them. The file mathsability2.sav 
should still be open. 


© Obtain the Factor Analysis dialogue box. 


If you are continuing directly from Computer Activity 4.2, the settings used there 
will have been retained; these are for an analysis of the standardized scores A to I 
with the first three principal components being extracted. If necessary, re-enter 
these settings. 


© Click on the Scores... button to open the Factor Analysis: Factor 
Scores dialogue box. 


© Check Save as variables. The Method area will become active. For 
principal component analysis, it does not matter which method is selected, 
the principal components will be the same, so leave the default setting. 


© Click on Continue, then click on OK. 


The output displayed in the Viewer window will be the same as in Computer 
Activity 4.2. Look at the Data View panel of the Data Editor. Notice that 
three new variables have been added to the data file, named FAC1_1, FAC2 1 and 
FAC3 1. These correspond to the first three principal components. The 
correspondence is not exact: the principal components produced by SPSS are 
scaled so they have variance 1. This scaling does not affect the interpretation of 
principal component scatterplots and is ignored in M249. 


In the Variable View panel, change the names of the new variables to pc1, pc2 
and pc3, and delete their labels. Also, change the number of decimal places 
displayed from 5 to 2. 


Now obtain a matrix scatterplot of the three principal components, such as that 
shown in Figure 4.2. 


You may need to use the scroll 
bar to see the new variables. 


Zs 


Computer Book 3 
























































" J28] 0 o 28 
o "oo 
uU oo 
a e 
oo 
oo 
o do, 
o 
NM 
| rk o 
a 28 
o 
oo" do 
o o 
o 
o, 
o 906 
9 Grego 
& lo 
o, 
ARE 
o o 
pci pc2 pc3 


Figure 4.2 Matrix scatterplot of the first three principal components for the 
mathematical ability data 


This matrix scatterplot displays nearly 83% of the total variance present in the 

original data set of dimension 9. It is easier to identify unusual points from this 
scatterplot than it would be from a matrix scatterplot of the nine variables A, B, 
..., I. For example, one boy (labelled in Figure 4.2) scores unusually highly on 

the third principal component. 


Do not close the file mathsability2.sav as you will need it for Computer 
Activity 4.4. There is no need to save your work if you are unable to continue 
directly to Computer Activity 4.4, as the file mathsability3.sav contains all the 
variables from mathsability2.sav, together with the principal components pc1, 
pc2 and pc3. 


Computer Activity 4.4 Plotting principal components 


In order to explore the structure of a data set it is often useful to focus on the 
first two principal components. You will do this for the mathematical ability data 
in this activity. 


You should still have the file mathsability2.sav open. If not, then open the file 
mathsability3.sav. The variable form indicates the form that each boy belonged 
to. 


(a) Obtain a scatterplot of the first two principal components, with pc1 on the 
x-axis and pc2 on the y-axis, using different colours or plotting symbols for 
each form. 


(b) Use the scatterplot you produced in part (a) to answer the following 
questions. 


(i) Do there appear to be differences in the value of the first principal 
component between the forms? 
(ii) Do there appear to be differences in the value of the second principal 
component between the forms? 
(c) Do your answers to part (b) make sense in light of your interpretation of the 


first two principal components (based on standardized data) in Computer 
Activity 3.4? 


24 


Chapter 5 Canonical discriminant analysis 


Computer Activity 4.5 Identifying groups 


This activity will give you some practice at extracting, saving and plotting 
principal components. 


In Computer Activity 3.2, you analysed data on the log concentrations of 

25 elements in 41 samples of groundwater in Nevada. The aim of the study is to 
gain an understanding of the groundwater system. If two samples of water from 
different locations have different chemical compositions, it is less likely that water 
can leach from one site to the other than if the samples have similar compositions. 
Thus it is of interest to identify groups of samples, and samples that appear 
different in some way. In this activity, you will undertake further analysis of these 
data. The data are in the file wells.sav. Open the file now. 


(a) Obtain a scree plot of the principal components based on the standardized 
log concentrations of the 25 elements. Use this scree plot to decide how many 
principal components to retain. 


(b) Save the components you identified as important in part (a), and tidy up the 
data file as described in Computer Activity 4.3. What is the cumulative 
percentage variance explained by these components? 


(c) Obtain a matrix scatterplot of the components you saved in part (b). Does it 
appear that the samples fall into distinct groups, or that some samples are 
unusual? 


(d) Obtain a scatterplot of the first two principal components. How many 
distinct groups of samples can you see? Identify these groups by labelling one 
point per group using the variable sample. 


Summary of Chapter 4 


In this chapter, you have learned how to obtain a scree plot in SPSS. You have 
also learned how to specify the number of principal components to extract, and 
how to save these components. You have seen how the principal components can 
be displayed using scatterplots. 


Chapter 5 
Canonical discriminant analysis 


In this chapter, you will use SPSS to obtain discriminant functions to separate 
known groups in data. Canonical discriminant analysis is done using 
Discriminant... from the Classify submenu of Analyze. The Swiss banknote 
data described in Example 10.1 of Book 3 will be used to illustrate the method. 
The data are in the file banknotes.sav. You will need this file for Computer 
Activities 5.1 and 5.2. 


25 


Computer Book 3 


Computer Activity 5.1 Obtaining a discriminant function 


Open the file banknotes.sav. There are eight variables: note numbers the 
individual notes from 1 to 200; type codes them as Genuine or Counterfeit; and 
the six measurements are in length, lwidth, rwidth, bottom, top and diagonal. 
The grouping variable is type, which is coded 1 for Genuine and 2 for Counterfeit. 


In discriminant analysis, SPSS uses the constraint V,,(D) = 1 and produces 
loadings based on group-standardized variables, by default. Obtain the first 
discriminant function for the Swiss banknote data, as follows. 


© Choose Discriminant... from the Classify submenu of Analyze. The 
Discriminant Analysis dialogue box will open. 


© Enter type in the Grouping Variable field. Once entered, the variable 
type is initially displayed as type(? ?) because it is necessary to specify the 
range of numbers used to code the groups. 


© Click on the Define Range... button. The Discriminant Analysis: 
Define Range dialogue box will open. 


© The genuine notes are coded 1 and counterfeit notes are coded 2, so type 1 in 
the Minimum field and 2 in the Maximum field. 


© Click on Continue. Notice that the entry in the Grouping Variable field 
has changed to type(1 2). 


© Enter the variables length, lwidth, rwidth, bottom, top and diagonal in 
the Independents field. Leave other settings unchanged. 


© Click on OK. 


Seven tables will be displayed in the Viewer window. Most of these tables can be 
ignored. Scroll through the output until you reach the Standardized Canonical 
Discriminant Function Coefficients table, which is reproduced below. 


Standardized 
Canonical 
Discriminant 
Function 
Coefficients 


length 
Iwidth 
rwidth 
bottom 
top 
diagonal 





'This table contains the loadings for the first discriminant function D, based on 
the group-standardized variables. For these data, the loadings are a; = —0.002, 
a2 = —0.262, a3 = 0.278, a4 = 1.028, as = 0.757, ag = —0.787, and hence the first 
discriminant function based on group-standardized data is 


D = —0.002Z, — 0.262Z2 + 0.27823 + 1.02824 + 0.757 Zs — 0.787 Z6: 


26 


Chapter 5 Canonical discriminant analysis 


Scroll up through the output and you will come to the Eigenvalues output table, 
as shown below. 


Eigenvalues 


Canonical 
LAN Eigenvalue | 96 of Variance | Cumulative 96 Correlation 





12.184? tooo | 100.0 | s61) 


a. —— canonical discriminant functions were used in the analysis. 


The quantity that SPSS calls the eigenvalue is the separation for D, 
Sep(D) = V(D)/V,, (D). In this case, the separation is equal to V; (D) since 
Vw(D) = 1. The separation achieved using the first discriminant function is 
therefore 12.184. 


Do not close the file banknotes.sav, as you will need it in Computer Activity 5.2. 


Computer Activity 5.2 Obtaining a discriminant function in terms of 
unstandardized variables 


In some situations — for example, for the purpose of allocating new observations 
to groups — discriminant functions based on variables that have not been 
group-standardized may be required. Obtain the discriminant function for the 
Swiss banknote data in banknotes.sav in terms of the unstandardized variables, 
as follows. 


© Obtain the Discriminant Analysis dialogue box. 


If you are continuing directly from Computer Activity 5.1, the variables will 
already be entered. If not, then enter them as described in Computer Activity 5.1. 


© Click on Statistics... 
dialogue box. 


to open the Discriminant Analysis: Statistics 


© Inthe Function Coefficients area, select Unstandardized. This ensures 
that the loadings are calculated for unstandardized variables. 


© Click on Continue, then click on OK. 


Most of the output that will be displayed in the Viewer window is the same as 
that obtained in Computer Activity 5.1. However, one extra table is produced. 
This is the Canonical Discriminant Function Coefficients output table, 
which is as follows. 


Canonical 
Discriminant Function 
Coefficients 


length -.005 
Iwidth -.832 
rwidth 
bottom 


.849 
1.117 
top 1.179 


diagonal -1.557 


(Constant) | 194.649 


Unstandardized 
coefficients 





Analyze > Classify > 
Discriminant... 


Pall 


Computer Book 3 


This output table contains the loadings of the discriminant function in terms of 
the unstandardized variables. For these data, the discriminant function is 


D = —0.005 length — 0.832 lwidth + 0.849 rwidth 
+ 1.117 bottom + 1.179 top — 1.557 diagonal + 194.649. 


The separation achieved is unchanged, and is equal to 12.184. 


For the Swiss banknote data there are two groups, and hence only one 
discriminant function can be found. For many data sets, more than one 
discriminant function can be found. In general, the Standardized Canonical 
Discriminant Function Coefficients output table will have one column for 
each discriminant function, and the Eigenvalues table will have one row for each 
discriminant function. This is illustrated in Computer Activity 5.3. 


Computer Activity 5.3 Satellite imaging of soil type 
A data set relating to images of the ground taken from a satellite is described in 


Example 10.7 of Book 3. These data form part of a larger data set comprising 
pixels classified in six categories coded as follows: 


1: red soil; 4: damp grey soil; 
2: cotton crop; 5: stubble; 
3: grey soil; 6: very damp grey soil. 


The variables are the readings made in four spectral bands, A, B, C and D. The 
data are in the file satellite.sav. Open the file now. There are five variables: 
group identifies the soil type, and bandA, bandB, bandC and bandD contain the 
readings made in the four spectral bands. 


(a) Carry out a canonical discriminant analysis, using the group-standardized 
variables obtained from bandA, bandB, bandC and bandD. How many 
discriminant functions have been obtained? Write down the separation 
achieved by each discriminant function. 


(b) Use the separations you obtained in part (a) to calculate the percentage 
separation achieved by the first discriminant function and the cumulative 
percentage separation achieved by the first two discriminant functions. 
Comment briefly on the contributions of subsequent discriminant functions. 
Where in the Eigenvalues output table are the percentage separation 
achieved and the cumulative percentage separation achieved reported? 


(c) Obtain the loadings for the first two discriminant functions based on 
group-standardized variables, and interpret them. 


(d) Write down expressions for Dı and D», the first two discriminant functions, 
based on the unstandardized variables. 


Computer Activity 5.4 Extracting and plotting discriminant 
functions 


In this activity, you will learn how to extract and save the values of discriminant 
functions, and how to obtain histograms of the values corresponding to the 
different groups. The Swiss banknote data will be used to illustrate the method. 
Open the file banknotes.sav or activate it (if it is already open). 


© Obtain the Discriminant Analysis dialogue box and click on Reset. 


© Enter type in the Grouping Variable field, specify the codes, and enter the 
six measurement variables in the Independents field. 


28 


Analyze > Classify > 
Discriminant... 


Chapter 5 Canonical discriminant analysis 


© Click on Save.... The Discriminant Analysis: Save dialogue box will 
open. 


© Select Discriminant scores. 
© Click on Continue, then click on OK. 


Look at the Data View panel of the Data Editor. Notice that an extra column, 
Disi 1, has been added. This column contains the values of the (first) 
discriminant function. In the Variable View panel, edit the new variable, as 
follows. Change its name from Disi 1 to disci, delete its label and reduce the 
number of decimal places displayed from 5 to 2. This helps reduce clutter in the 
output. Now obtain histograms of the values of the discriminant function for 
genuine notes and counterfeit notes, as follows. 


© Obtain the Histogram dialogue box. 

© Enter disci in the Variable field. 

© Inthe Panel by area, enter type in the Rows field. 
© Click on OK. 


The histograms that will be displayed in the Viewer window are shown in 
Figure 5.1. 





40-] 


304 


204 


euinuas 








Frequency 


304 


204 


depono? 











Figure 5.1 Histograms of the first discriminant function for the banknote data 


Note that the two histograms are drawn using the same scale, thus allowing you 
to assess visually the degree of separation between the groups. These histograms 
are said to be stacked. Scroll up through the output to the Functions at Group 
Centroids table. (This is reproduced in the margin.) This table gives the mean 
value of the discriminant function for each group. Looking at the histograms, it is 
clear that these two values, —3.473 and 4-3.473, plausibly correspond to the 
means of the two groups. Notice that all but one of the bars representing genuine 
notes lie below zero. Furthermore, the histogram representing the counterfeit 
notes is entirely located above zero. This suggests that the discriminant function 
has successfully separated the counterfeit notes from the genuine notes. 


You will need the variable disci in Chapter 6, so the data and the variable disci 
have been saved in the file banknotes2.sav. 


If you repeat the procedure, an 
identical column named Dis1, 2 
will be added to the file. 


Graphs > Legacy Dialogs > 
Histogram... 


Functions at Group 
Centroids 


he 15. 
i | 4 | 


Genuine -3.473 
Counterfeit 3.473 
Unstandardized 
canonical discriminant 
functions evaluated at 
group means 





29 


Computer Book 3 


Computer Activity 5.5 Displaying the discriminant functions for the 
satellite data 


In Computer Activity 5.3, you obtained the four discriminant functions for 
separating six groups of pixels corresponding to different soil types. The data are 
in the file satellite.sav. Open this file now or activate it (if it is already open). 


(a) Obtain the four discriminant functions in terms of the group-standardized 
variables, and save their values in variables named disci, disc2, disc3 and 
disc4. Delete the variable labels, and reduce the number of decimal places 
displayed from 5 to 2. 


(b) Obtain stacked histograms for the first two discriminant functions. Which 
groups are best separated by the two discriminant functions, and which 
groups are least well separated? 


(c) Obtain a scatterplot of the first two discriminant functions, with the six 
groups identified using different colours or plotting symbols. Briefly describe 
the location of the six groups on the scatterplot. 


Summary of Chapter 5 


In this chapter, you have learned how to carry out a canonical discriminant 
analysis in SPSS. You have obtained discriminant functions based on 
group-standardized and on unstandardized data, and the separation and 
percentage separation achieved by each discriminant function. You have also 
learned how to extract and save values of the discriminant functions and how to 
produce stacked histograms of the values for the groups. 


Chapter 6 
Allocation 


In this chapter, you will learn how to specify an allocation rule in SPSS, and how 
to evaluate it using the confusion matrix for the training set. You will also learn 
how to apply an allocation rule to a test set. An allocation rule can be specified 
and applied to a training set or a test set using Recode into Different 
Variables from the Transform menu. The confusion matrix can be obtained 
from a crosstabulation of actual and allocated groups. This is illustrated in 
Computer Activities 6.1 and 6.2 for the Swiss banknote data. 


30 


If the file was already open, use 
Reset to cancel previous 
settings. 


Chapter 6 Allocation 


Computer Activity 6.1 Allocating observations to groups 


In this activity, you will classify the Swiss 1000-franc banknotes using the values 
of the discriminant function disci. The variable disci is saved in the file 
banknotes2.sav. Open this file now. 


In Computer Activity 5.4, you found that disci has mean —3.473 for genuine 
notes and 3.473 for counterfeit notes. So the appropriate cutpoint for the 
allocation rule is |; = (—3.473 + 3.473)/2 = 0. Thus the allocation rule is as 
follows: 


ifd<0 classify as genuine, 
otherwise classify as counterfeit. 


Create a new variable called allocated, containing the group membership codes 
determined by this allocation rule, by recoding the variable disci, using the 
code 1 for values less than or equal to 0, and the code 2 for values greater than 0 
(or greater than or equal to 0.001, say). 


The new variable allocated will be added to the data file. In the Variable 
View panel of the Data Editor, assign value labels, as follows. 


© Activate the cell in the Values column corresponding to the variable 
allocated. A blue box will appear in the cell. 


Click on this blue box and the Value Labels dialogue box will open. 
Type 1 in the Value field. 

Type Genuine in the Label field, and click on Add. 

Type 2 in the Value field. 

Type Counterfeit in the Label field, and click on Add. 

Click on OK to close the Value Labels dialogue box. 


oOo O09 2 0 


The variable allocated contains the allocated groups, labelled Genuine and 
Counterfeit. Obtain a frequency table of the variable allocated, as follows. 


© Choose Frequencies... from the Descriptive Statistics submenu of 
Analyze. 


© Enter allocated in the Variable(s) field. 
© Click on OK. 
The following table will be displayed in the Viewer window. 


allocated 


Cumulative 
Percent 


Valid Genuine 49.5 


Counterfeit $ : 100.0 
Total 0 0 





The table shows that the rule allocates 99 notes to the Genuine group and 101 to 


the Counterfeit group. 


The data file including the variable allocated has been saved as 
banknotes3.sav, so if you are not continuing directly to Computer Activity 6.2, 
you need not save your data file. 


Recoding variables is described 
in the Introduction to statistical 
modelling. Use Transform > 
Recode into Different 
Variables... 


Sl 


Computer Book 3 


Computer Activity 6.2  Misclassification rates and confusion 
matrices 


In this activity, you will construct a contingency table of actual group against 
allocated group for the banknote data by obtaining a crosstabulation of actual 
and allocated groups. This can be used to calculate the misclassification rate for 
an allocation rule and to obtain the associated confusion matrix. Open the file 
banknotes3.sav. 

Obtain a crosstabulation of actual and allocated groups, as follows. 

© Obtain the Crosstabs dialogue box. Analyze > Descriptive 
© Enter type in the Row(s) field and allocated in the Column(s) field. 


© Click on the Cells... button. The Crosstabs: Cell Display dialogue box 
will open. 


© Select Row in the Percentages area, and make sure that Observed is 
checked in the Counts area. 


© Click on Continue, then click on OK. 


The following crosstabulation will be displayed in the Viewer window. 


type * allocated Crosstabulation 


| alocated ^ | 


Genuine ——— Total 


Genuine Count 99 100 
% within type 99.0% 1. i 100.0% 


Counterfeit Count 0 -100 
Total Count 200 
% within type 9.59 5% 





The entries in the Count rows of the table correspond to the numbers of notes in 
each of the categories. Thus 1 genuine note is classified as counterfeit, and 

0 counterfeit notes are classified as genuine. Therefore the misclassification rate is 
(1 + 0)/200 x 100% = 0.596. The 4 within type rows give the percentages of 
genuine notes classified as genuine and as counterfeit, and the corresponding 
percentages for counterfeit notes. T'hus the confusion matrix can be written down 
directly using this table. 


Computer Activity 6.3 Confusion matrix for the satellite data 


In Computer Activity 5.5, you obtained the discriminant functions for the satellite 
data, extracted and saved values of the discriminant functions, and obtained 
stacked histograms of the values for the first two discriminant functions. The data 
and the values of the four discriminant functions are in the file satellite2.sav. 
Open this file now. In this activity, you will construct and evaluate an allocation 
rule based on the first discriminant function, disci. 


(a) Obtain a table giving the mean of disci for each of the six groups. Hence 
verify that the following allocation rule is appropriate: 


if d < —3.6730 classify as cotton crop, 

if —3.6730 < d € —0.4135 classify as stubble, 

if —0.4135 < d < 0.5466 classify as red soil, 

if 0.5466 < d < 1.2696 classify as very damp grey soil, 
if 1.2696 < d < 1.6454 classify as damp grey soil, 
otherwise classify as grey soil. 


32 


Statistics > Crosstabs... 


Chapter 6 Allocation 


(b) Create a new variable named allocated containing the soil types allocated 
by the rule given in part (a). When assigning value labels, use the following 


coding: 
1: red soil; 4: damp grey soil; 
2: cotton crop; 5: stubble; 
3: grey soil; 6: very damp grey soil. 


Obtain a frequency table for allocated. 


(c) Obtain the confusion matrix for the allocation rule. Briefly describe the 
accuracy of this rule. 


In Computer Activity 6.2, you learned how to obtain the confusion matrix for a 
training set. Sometimes the observations from a test set need to be classified. The 
method is illustrated in Computer Activity 6.4. 


Computer Activity 6.4 Confusion matrix for a test set 


In Computer Activity 5.3, you obtained the first discriminant function for the 
satellite data. This discriminant function, expressed in terms of unstandardized 
variables, is as follows: 


disci = 0.045 bandA + 0.076 bandB — 0.038 bandc — 0.049 bandD — 1.743. 


'This discriminant function was calculated from 4435 observations. A further set 
of 2000 observations is in the data file testsat.sav. These observations may be 
used to check the accuracy of the allocation rule derived using disci. Open the 
file testsat.sav. 


(a) Create a new variable disci containing the values of the first discriminant 


function, evaluated on the 2000 observations of the test set. Hint: use the The use of the Compute 
Compute Variable dialogue box. Variable dialogue box is 
described in the Introduction to 
(b) Create a variable named allocated by recoding disc1 according to the statistical modelling. 


allocation rule given in part (a) of Computer Activity 6.3; use the value 
labels given there. Obtain the frequency table for allocated. 


(c) Obtain the confusion matrix for the test set. Briefly compare it with the 
confusion matrix for the training set that you obtained in part (c) of 
Computer Activity 6.3. 


Summary of Chapter 6 


In this chapter, you have learned how to use SPSS to construct and evaluate an 
allocation rule based on the first discriminant function. You have learned how to 
calculate misclassification rates and obtain the confusion matrix. You have also 
learned how to apply an allocation rule to a test set. 


B6 


Computer Exercises on Book 3 


Computer Exercise 1 Heavy metals in Alpine tree rings 


In a study of the impact of atmospheric pollution on the natural ecosystem of the 
Western Italian Alps, measurements were taken on the concentration of five heavy 
metals in Larix decidua trees from this region. For each tree, measurements were 
made at seven different tree ring depths, corresponding to the seven decades 
1930-39, 1940-49, ..., 1990-99. The concentrations of cadmium (Cd), 

chromium (Cr), copper (Cu), lead (Pb) and nickel (Ni) were obtained (in parts 
per million of metal per dry weight of wood). Measurements from 20 trees were 
obtained. 


The data are in the file treerings.sav. There are nine variables: sample identifies 
the tree, metal indicates the heavy metal, and the seven variables 1og90s, 
log80s, ..., log30s are the log concentrations for the seven decades. 


(a) (i) | Obtain a scatterplot of 1og90s against log30s, showing the different 
metals as groups in the data. 


(ii) There are five measurements for cadmium that do not fit the overall 
pattern. Identify the trees corresponding to these points, by labelling 
them with the sample variable. 


(iii) Comment briefly on the relationships between the two variables for the 
five heavy metals. Do the measurements differ between types of metal? 


(b) (i) Obtain a matrix scatterplot of the seven variables log90s, ..., log30s. 


(ii) In the panel corresponding to 1og90s and 10g30s, label the five outliers 
you identified in part (a)(ii). Do these samples necessarily correspond 
to unusual points in the other panels of the matrix scatterplot? 


(c) Obtain the mean vector of the log concentrations. In which decade was the 
mean concentration highest? Describe the change in mean log concentrations 
over time. 


(d) Obtain the correlation matrix for the log concentrations. Identify the pair of 
variables for which the correlation is lowest and the pair for which it is 
highest (in absolute value). Comment in general terms on the relationship 
between log concentrations in different decades. 


Computer Exercise 2 The cost of crime 


The data set in the file crimeloss2.sav was discussed in Computer Activities 1.4 
and 2.4. Open this file and undertake a principal component analysis of the four 
variables proploss, persloss, stateexp and localexp, as follows. 


(a) Obtain the means and standard deviations of the four variables, and discuss 
whether to base the PCA on standardized or unstandardized data. 


(b) Perform a principal component analysis based on standardized data, and 
extract all four principal components. Write down their variances. 


(c) Obtain a scree plot. Does it help in identifying how many principal 
components should be retained? Explain your answer. 


(d) Write down the percentage variance explained by each of the first two 
principal components, and the cumulative percentage variance explained by 
the first two principal components. 


(e) Briefly interpret the first two principal components. 


(f) Extract the principal components, and obtain a scatterplot of the first two 
components. Identify any unusual points on the scatterplot by labelling them 
with the variable state. Briefly summarize your results. 


34 


Computer Exercises on Book 3 


Computer Exercise 3 Shades of grey 


The data file greysat.sav contains the satellite data relating to pixels identified 
as grey soil, damp grey soil and very damp grey soil which are described in 
Example 10.7 of Book 3. These data are a subset of the larger data set described 
in Computer Activity 5.3. The data file greytest.sav contains a test set of 

1078 observations. 


(a) 


Open the data file greysat.sav and perform a canonical discriminant 
analysis of the four variables bandA, bandB, bandC and bandD, with group as 
grouping variable (coded 1 for grey soil, 2 for damp grey soil, and 3 for very 
damp grey soil). Discuss briefly whether more than one discriminant function 
is needed to separate the groups. 


Write down the first discriminant function in terms of the unstandardized 
variables, and extract and save values of this discriminant function. 


Obtain stacked histograms of the values of the first discriminant function. 
Briefly discuss how well the groups are separated. 


Calculate the cutpoints for an allocation rule based on the first discriminant 
function. (Use four decimal places for the cutpoints.) 


Obtain the confusion matrix for this allocation rule based on the training set. 
Briefly describe how accurate the allocation rule is. 


Now open the data file greytest.sav. Use the expression you obtained in 
part (b) and the allocation rule you derived in part (d) to obtain the 
confusion matrix for this test set. Calculate the overall misclassification rate 
(that is, the percentage of misclassified pixels in the data set as a whole) for 
the test set, and also for the training set. What do you conclude? 


25 


Learning outcomes 


You have been working to acquire the following skills in using SPSS. 


© Represent groups on a scatterplot using different colours or different plotting 
symbols. 


Produce a matrix scatterplot. 
Label points on scatterplots and on matrix scatterplots. 


'Iranspose data and produce a profile plot. 


o 
o 
o 
© Obtain mean vectors, correlation matrices and covariance matrices. 
© Obtain mean vectors for subgroups of data. 

© Save standardized variables. 

o 


Carry out a principal component analysis with standardized data and with 
unstandardized data. 


© Obtain a scree plot. 


© Specify the number of principal components to extract, and save the 
components. 


© Carry out a canonical discriminant analysis. 


© Obtain discriminant functions in terms of group-standardized variables and 
in terms of unstandardized variables. 


© Obtain the separation and percentage separation achieved by a discriminant 
function. 


© Extract and save values of a discriminant function, and produce stacked 
histograms of the values for the groups. 


© Construct and evaluate an allocation rule based on the first discriminant 
function. 


© Calculate misclassification rates and obtain the confusion matrix for the 
training set and for a test set. 


36 


Solutions to Computer Activities 


Solution 1.2 

(a) The scatterplot, with regions represented by different black symbols, and The method is described in 
with the two states with largest and smallest expenditure on policing Computer Activity 1.1. 
labelled, is shown in Figure 8.1. 














300.00-] dod 
^ Mid Atlantic 
à * Midwest 
O Northeast 
o ¥ South 
* Southwest 
250.004 ü El West 
4 
+ 
x v 
o 

200.004 + a 
2 E: B a 
x o o v * 
o n 
o x a x 
" 

Sot oy 
150.00 » + v 
o n v v *v 
x 
v 
o n i * x + 
x 
o v 
100.004 * 
50.00 
T T T T 
200.00 400.00 600.00 800.00 


totloss 


Figure $.1 Scatterplot of cost of crime and expenditure on policing 


(b) Overall, there is a roughly linear relationship between expenditure on 
policing and the cost of crime. Similar relationships apply for states in both 
the South and West regions. However, for a fixed crime loss, the expenditure 
on policing in the West region is higher than in the South region. 


Computer Book 3 


Solution 1.4 


(a) The matrix scatterplot is shown in Figure S.2. 


stateexp persloss proploss 


localexp 





Oo 
Q 
® 
New York 








h 
GR New York 

















9 
dide 
OgNew York 
o 





New York 






































O 
9 o 
[o o9 
Ó 











proploss 


persloss 


stateexp 


localexp 


Figure $.2 Matrix scatterplot of the crime and policing data 


(b) The two crime loss variables are positively related. The two expenditure 


variables do not appear to be strongly related. 


(c) The unusual point corresponds to the state with the highest per capita local 
expenditure on policing and average per capita property loss. This state, 
New York, is labelled in Figure 8.2. 


(d) The state of New York is unusual in other respects. For example, it has 
rather low state expenditure on policing (given its high local expenditure) 
and rather high personal loss from crime (given its average property loss). 


Solution 1.6 


(a) The observations in this data set are the eight settings. 


(b) The profile plot is shown in Figure S.3. 


normalized response 





1,004 


8074 


604 


405 


2074 








- = = +30d,40m, 5g 
— == 30d, 40m, 2g 
— - — - 30d, 20m, 5g 
== e 300,20m) 20) 
Fecooo 50d, 40m, 5g 
poocooono 50d, 40m, 2g 
E c c sis anres) 
50d, 20m, 5g 











sensor 


Figure S.3 Profile plot of Iberian hams data 


38 


T T T T T T T T 
s7 S 8) GIG! Gili] SIRE s13 STA sali 


The method is described in 
Computer Activity 1.3. 


The method is described in 
Computer Activity 1.5. 


Solutions to Computer Activities 


In this plot the line styles have been altered to help identify the lines, the 
axis labels have been changed, and the labels in the legend have been 
replaced with more explicit descriptions. The sensors have not been 
re-ordered along the x-axis. 


The settings 50°C, 20 minutes, 5 g and 50? C, 40 minutes, 5g seem to produce 
the best responses overall: the two lines corresponding to these settings are 
generally above the other six. The responses for these two settings are the 
highest except at sensors 1, 5 and 6. 


Solution 2.4 


(a) 


A Minimum Maximum Mean Std. Deviation 


proploss 
persloss 


stateexp 
localexp 
Valid N (listwise) 


'The mean vector can be obtained using the Descriptives dialogue box. The 
Descriptive Statistics output table is reproduced below. 


Descriptive Statistics 


89.50 51.5960 14.97360 
676.29 | 379.6856 173.79300 
83.87 .861 15.13987 
261.14 j 42.12179 


nom tn tn e 
eco 


eo 





Thus the mean vector for (proploss, persloss, stateexp, localexp), with 
elements rounded to two decimal places, is (51.60, 379.69, 30.87, 134.54). 
'Thus, on average, per capita property loss is much less than per capita 
personal loss, and state expenditure on policing is much less than local 
expenditure. 


The correlations and covariances are found using the Bivariate 
Correlations dialogue box, as described in Computer Activity 2.3. The 
correlations and covariances are given in the Correlations output table. 


The largest covariance (in absolute value) between two distinct variables is 
3440.643, between persloss and localexp. The largest correlation (in 
absolute value) between two distinct variables is 0.693, between persloss 
and proploss. The two pairs of variables are different because the covariance 
reflects not just the correlation between the two variables, but also their 
variance, and the variance of localexp is much greater than that of 
proploss. 


The lower triangle of the correlation matrix (with variables in the order 
proploss, persloss, stateexp, localexp) is as follows. 


0.693 
—0.008 0.369 
0.426 0.470 —0.015 


There is little evidence of an association between state expenditure on 
policing and property loss (correlation —0.008), or between state and local 
expenditure on policing (correlation —0.015). The other pairs of variables 
appear to be moderately positively associated, the strongest relationship 
being between property loss and personal loss (correlation 0.693). 


The method is described in 


Computer Activity 2.1. 


oe 


Computer Book 3 


Solution 3.2 


(a) 


Using the Descriptives dialogue box to obtain means and standard 
deviations is described in Computer Activity 2.1. The variable with the 
largest standard deviation is logSr, with standard deviation 2.254. The 
variable with the smallest standard deviation is logMo, with standard 
deviation 0.608. Thus the variance of logSr is (2.254/0.608)? ~ 13.7 times 
that of logMo. If each trace element is to be given equal weight in assessing 
the water composition, then it is sensible to standardize the variables prior to 
undertaking a PCA, to avoid the results being unduly influenced by the 
variables with large variance. 


The Total Variance Explained output table shows that the first five 
components are retained using Kaiser's criterion. 


From the Total Variance Explained output table, the PVE by the first 
component is 41.519956. The PVE by the four subsequent components is 
16.607%, 11.394%, 8.694% and 5.320%. The CPVE by the first two 
components is 58.126%, and the CPVE by the five retained components 
is 83.533%. 


The loadings are listed in the Component Matrix output table. The loadings 
for logLi for the first two components, as calculated by SPSS, are 

o, = 0.359 and a5, = 0.841. From the Total Variance Explained output 
table, the variances of the first two principal components are V(Y1) = 10.380 
and V(Y3) = 4.152. Hence the required loadings are 


ayy = aj /u/ V(Y1) = 0.359/4/10.380 ~ 0.111, 
031 = o3, / / V(Y5) = 0.841/4/4.152 ~ 0.413. 


Solution 3.4 


(a) 


(b) 


'The use of SPSS for principal component analysis based on standardized data 
is described in Computer Activity 3.1. 


(i) SPSS retains two components, with CPVE = 73.227%. 


(ii) The loadings obtained by SPSS are given in the Component Matrix 
output table. The first component has broadly similar loadings for all 
nine scores, so it represents overall ability in mathematics. The second 
component has positive loadings for scores A, B, C and D (geometrical 
ability) and negative loadings for scores E, F, G, H and I (arithmetical 
and algebraical ability). Thus the second component represents a 
contrast between geometrical ability and other types of mathematical 
ability. 


PCA based on unstandardized data is described in Computer Activity 3.3. 


(i) SPSS has retained three components, with CPVE = 89.637%. Thus one 
more component is retained than in the analysis based on standardized 
data, and the CPVE is higher. 


(ii) The loadings obtained by SPSS are given in the Component Matrix 
output table. The loadings for the first component are greatest for 
scores C, F and I. This is not surprising, since these are the three 
variables with the largest variances, as shown in the Communalities 
output table. 


(iii) All the loadings for the first component are positive. This component 
can therefore be interpreted as representing overall mathematical ability. 
The second component has negative loadings for scores E, F and G, 
which are associated with arithmetical ability. Thus this component 
contrasts arithmetical ability with other forms of mathematical ability. 
The third component has negative loadings for scores H and I, which 
are associated with algebraical ability. Thus this component contrasts 
algebraical ability with other forms of mathematical ability. 


40 


Carrying out a principal 
component analysis using 
standardized data is described 
in Computer Activity 3.1. 


The CPVE is given in the Total 
Variance Explained table. 


Solutions to Computer Activities 


(c) From the Initial column of the Raw part of the Communalities output 
table it is clear that the variances of the scores differ greatly; for instance, 
score I has variance 711.311, whereas score B has variance 62.397. This might 
suggest that the analysis based on the standardized data is preferable. 


Solution 4.4 


(a) The scatterplot is obtained using the procedure described in Computer 
Activity 1.1. To distinguish the different forms, enter form in the Set 
Markers by field of the Simple Scatterplot dialogue box. A scatterplot 
using different black plotting symbols for the different forms is shown in 
Figure $8.4. 





aga form 
eiu 
4UVa 
WUVb 
+uvi 
2.00- = ^ X vil 
a x 
x 
" 
é a 
1.00 
af e ^* 
" x 
" si een. x 
" " 
“oe 9,9 x4 * 
i m ma * 
& 0] " x * 
+ + et x xk 
n 94 x x 
x x 
L| + 
x 
-1.00 -4 P 
ec ++ 
ee 
e 
" 
x 
-200-] à 
* + 
3.00 








Figure $.4 Scatterplot of the first two principal components 


(b) (i) 


(ii) 


The values of the first principal component differ according to the forms 
the boys are in. The values of pc1 are generally highest for the boys in 
form VII and lowest in form UVb. This reflects the progression of the 
boys as they move up through the school. 


There do not seem to be any differences in the values of the second 
principal component between forms. For each form, the points have 
roughly the same distribution vertically. 


(c) The value of the first principal component is partly determined by a boy's 
position in the school. This makes sense given that the first component can 
be interpreted as measuring general ability in mathematics. Reassuringly, 
mathematical ability seems to improve as the boys progress through the 
school. 


The fact that the value of the second principal component does not appear to 
vary between forms also make sense. The second principal component 
roughly contrasts geometrical ability and other types of mathematical ability 
for the boys. It is not surprising that this does not change as boys progress 
through the school. 


41 


Computer Book 3 


Solution 4.5 


. Obtaining a scree plot is 
(a) The scree plot is shown in Figure S.5. described in Computer 


Activity 4.1. 


Scree Plot 





Eigenvalue 











T 1S AR Fr o gg Ir Dae Dae RUFUS ELITR JR 
wra A5 Gs T go Oba 12 313914 e 17 18 19 20071 2 23 24 25 


Component Number 


Figure S.5 Scree plot of the water samples data 


The elbow in this plot is not obvious, but perhaps the elbow occurs at 
component number 6, in which case five components should be retained. In 
Computer Activity 3.2, five components were retained using Kaiser’s 


criterion. 

(b) If you decided to retain k components in part (a), then enter k in the Saving principal components is 
Factors to extract field of the Factor Analysis Extraction dialogue box. described in Computer 
This solution uses k = 5. The variable names were changed to pc1, ..., pc5. Activity 4.3. 


The CPVE by the five retained components is 83.53396, or 83.596 to one 
decimal place. 


(c) The matrix scatterplot of the five components is shown in Figure 8.6. 





g 


pci 
Ps 
pH 





pc2 
S5 
* 
a? 
ae 
P 





pc3 
o 
© 
o, 
og 
^Bo 
S, 
o 
6 
a? 





pc4 
Ki 
8 o 
© 
o 
WSA 
oo 
oe 
Q^ 
ae 
® Oo 


























o o o 
8 8 M dfo ° 8 o9 

‘9 o A 

qp odi So 926 e 
pci pc2 pc3 pc4 pc5 


Figure S.6 Matrix scatterplot of the five principal components 


42 


Solutions to Computer Activities 


The scatterplot indicates the presence of some unusual points separated from 
the main clump. There are also some groups of samples, for example on the 
panel for pc1 and pc5. 


(d) A scatterplot of the first two principal components is shown in Figure S.7. 





3.007] 
e 
o 
o 9 
1.007] 9G i 
o 3S1Zn2 5 
F 
N - 15P 
S 00] SD6STD o0 o 
"ee 
o 
o o 
o 
-1.004 è o 
Q 
-2.004 
F BGW 
-3.007 











pci 


Figure S$.7  Scatterplot of the first two principal components, with one point in 
each group of samples identified 


'The samples appear to fall into distinct groups. Seven distinct groups have 
been identified on Figure 8.7. You may well have chosen slightly different 
groups. For example, the three samples in the top right-hand corner have 
been designated as a single group, but could arguably be described as two 
separate groups (one group with a single point, and one with two points). 


Solution 5.3 


(a) Obtaining the discriminant functions in terms of the group-standardized 


variables is described in Computer Activity 5.1. Since there are six groups, 
enter the values 1 and 6 in the Define Range dialogue box. The number of 
groups G is 6, and the number of variables p is 4. The number of 
discriminant functions is the minimum of G — 1 and p, so there are four 
discriminant functions. The separations achieved by the discriminant 
functions are given in the Eigenvalue column of the Eigenvalues output 
table, which is reproduced below. 


Eigenvalues 


Canonical 
Function | Eigenvalue | 96 of Variance | Cumulative 96 Correlation 
1 


2 
3 
4 





a. First 4 canonical discriminant functions were used in the analysis. 


The separations achieved are 5.900 for the first discriminant function, 4.071 
for the second, 1.523 for the third and 0.016 for the fourth. 


43 


Computer Book 3 


(b) 


The percentage separation achieved by the first discriminant function is 


5.900 
PSA, = Lx 100% © 51.396. 
SAi = 5900 4071 4- 1523 40.016 * 10076 = 51.3% 


The cumulative percentage separation achieved by the first two discriminant 
functions is 


5.900 + 4.071 
CPSA = — 9 TA x TORO ~ 86.6%. 
2 = E9001 40711523 10.016 * 100% % 





The third discriminant function contributes relatively little to separating the 
groups, and the fourth hardly contributes at all. These values are shown in 
the 4 of Variance and Cumulative % columns of the Eigenvalues output 
table. 


The loadings in terms of the group-standardized variables are given in the 
following output table. 


Standardized Canonical Discriminant Function 
Coefficients 





'The loadings for the first discriminant function are positive for bands A 
and B, and negative for bands C and D. Hence this discriminant function 
contrasts the readings obtained in bands A and B with those obtained in 
bands C and D. The loadings for the second discriminant function are close 
to zero for bands C and D. This discriminant function contrasts readings in 
band A with readings in band B. 


'The loadings in terms of the unstandardized variables are obtained as 
described in Computer Activity 5.2. They are given in the Canonical 
Discriminant Function Coefficients output table. 


In terms of the unstandardized variables, the first two discriminant functions 
are 
D = 0.045 bandA + 0.076 bandB — 0.038 bandc — 0.049 bandD — 1.743, 
Də = 0.268 bandA — 0.156 bandB — 0.001 bandC — 0.001 bandD — 5.426. 


Solution 5.5 


(a) 


(b) 


Saving the values of discriminant functions is described in Computer 
Activity 5.4. Editing the content of the Variable View panel of the Data 
Editor is also described in that activity. 


Histograms for the first discriminant function are shown in Figure 8.8. 


'The main separation achieved by the first discriminant function is between 
pixels corresponding to cotton crop and the rest. There is also some 
separation between red soil and stubble, on the one hand, and the three types 
of grey soil, on the other. 


Histograms for the second discriminant function are shown in Figure 8.9. 


This discriminant function primarily separates red soil from all other types of 
soil. 


44 


Obtaining stacked histograms is 
described in Computer 
Activity 5.4. 


Solutions to Computer Activities 

















eo 








Frequency 
A 
Q 
i 























0 T T T T T T 
-12.00 -10.00 -8.00 -6.00 -4.00 -2.00 00 2.00 


disc1 


Figure S.8 Histograms of values of the first discriminant function 























Frequency 
ES en qd) 
nS o 
Se) XE C 
eal 

















-8.00 -6.00 -4.00 -2.00 .00 2.00 4.00 
disc2 


Figure S.9 Histograms of values of the second discriminant function 


ei|qqniss — jiosKaJ6 — [ros Aaub doi IIOS pad 
dwep u0}}02 


Jos Ko16 
duep 
Alan 


e|qqms ios AaiH [ios Aaub dois IIOS pad 
dwep u0}}02 


Jos Ko46 
dwep 
Alan 


dnoJ6 


dnojJ5 


45 


Computer Book 3 


(c) The scatterplot with different plotting symbols for the different groups is 
shown in Figure 5.10. 

















group 
v O red soil 
400-1 A cotton crop 
V grey soil 
U damp grey soil 
+ stubble 
X very damp grey soil 
2.004 
.00-1 
N 
[9 
da] 
"o 
-2.00 4 
-4.00 4 
-6.00 4 
o 
-8.00 4 
T T T T T T T T 
-12.00 -10.00 -8.00 -6.00 -4.00 -2.00 .00 2.00 


disc1 


Figure S.10  Scatterplot of the first two discriminant functions 


You may find that using different colours makes the picture clearer. Different 


black symbols have been used in Figure 8.10 for printing purposes. It is 
possible to distinguish four subgroups in this scatterplot. The cotton crop 
pixels are on the left-hand side, and the red soil pixels are in the bottom 


right-hand corner. On the right-hand side near the top are the three grey soil 


groups, which are rather difficult to tell apart. Finally, the pixels 
corresponding to stubble lie in the centre of the scatterplot. 


Solution 6.3 


(a) The mean vector can be obtained using Means... from the Compare 
Means submenu of Analyze, as described in Computer Activity 2.1. The 
means of the six groups, in increasing order, are as follows: cotton crop, 
—6.5265; stubble, —0.8196; red soil, —0.0075; very damp grey soil, 1.1007; 
damp grey soil, 1.4385; grey soil, 1.8522. The allocation rule is based on 
cutpoints at the midpoints between the group means. So, for example, the 
cutpoint between cotton crop and stubble is 


l4 = (—6.5265 — 0.8196) /2 ~ —3.6730. 


The other cutpoints are calculated in a similar way, leading to the allocation 
rule given. 


46 


Using different plotting symbols 
for different groups is described 
in Computer Activity 1.1. 


The means are also given in the 
Functions at Group 
Centroids table in the output 
of a canonical discriminant 
analysis. 


Solutions to Computer Activities 


(b) The variable allocated is obtained as described in Computer Activity 6.1. 
The coding is as shown in Figure S.11. 








r Old Value New Value 





© Value: @ Value: [ 
© System-missing 
© System-missing © Copy old value(s) 
© System- or user-missing 
© Range: 











Old — New: 

Lowestthru -3.6730 — 2 
c |-3.6729 thru -0.4135 > 5 
through | a] -0.4134 thru 0.5466 — 1 

| d ||0.5467 thru 1.2696 — 6 


© Range, LOWEST through value: (Remove | 1.2697 thru 1.6454 — 4 
c 1.6455 thru Highest — 3 


(8) Range, value through HIGHEST: 


sn [I Outputvariables are stings Widi |e 


© All other values lll Convert numeric strings to numbers (5-75) 


(Ss) Gen) n 














Figure S.11 Completed dialogue box for the recoding 
'The data and the variable allocated have been saved in the file 
satellite3.sav. 


The frequency table is as follows. 


allocated 


Cumulative 
en Percent | Valid Percent Percent 


red soil 
cotton crop 
grey soil 


damp grey soil 
stubble 

very damp grey soil 
Total 





Computer Book 3 


(c) The confusion matrix can be obtained from a crosstabulation of group and 
allocated, as described in Computer Activity 6.2. The crosstabulation is 
shown below. 


group * allocated Crosstabulation 


ea 

red soil cotton crop | grey soil soil stubble grey soil Total 
red soil Count 568 0 6 10 311 177 1072 
cotton crop Count 20 407 0 1 48 3 479 
grey soil Count 4 0 718 196 0 43 961 


damp grey soil Count 14 0 146 145 0 110 415 
stubble Count 127 1 1 6 306 29 470 
very damp grey soil Count 68 0 50 286 2 632 1038 

% within group 18.1% 9.2% 20.8% 14.5% 15.0% 22.4% 100.0% 


The pixels corresponding to cotton crop and grey soil are relatively accurately 
classified: 85.0% and 74.7% of these, respectively, are correctly classified. 
Damp grey soil pixels are inaccurately classified: only 34.9% are correctly 
classified, and the majority are misclassified to other types of grey soil. The 
pixels corresponding to red soil and stubble are not very well classified, and 
tend to be mutually misclassified (red soil as stubble, and vice versa). 





Solution 6.4 


(a) The variable disc1 can be calculated using the Compute Variable dialogue 
box. Type disc1 in the Target Variable field, and enter the following 
expression in the Numeric Expression field: 


0.045*bandA*0.076*bandB-0.038*bandC-0.049*bandD-1.743 
e method is described in Computer Activity 6.3. The frequency table is as 
b) Th thod is described in C ter Activity 6.3. The f table i 


follows. 


allocated 


Cumulative 
T Percent | Valid Percent Percent 


red soil 
cotton crop 
grey soil 


damp grey soil 
stubble 

very damp grey soil 
Total 





The variables disci and allocated have been saved in the file testsat2.sav. 


48 


Solutions to Computer Activities 


(c) The confusion matrix may be obtained from the crosstabulation of group and 
allocated, which is shown below. 


group * allocated Crosstabulation 


Ps -—Llru umm 

red soil cotton crop | grey soil soil grey soil Total 
red soil Count 260 0 1 2 131 67 461 
cotton crop Count 6 198 0 0 18 2 224 
grey soil Count 2 0 236 114 0 45 397 


damp grey soil Count 5 0 44 90 71 211 
stubble Count 61 0 1 3 154 18 237 
very damp grey soil Count 38 0 14 91 4 323 470 
Count 372 198 296 300 308 526 2000 
The table is broadly similar in many respects to that obtained in Computer 
Activity 6.3. The major difference relates to the proportion of grey soil pixels 


that are correctly classified. This was 74.7% in the training set, but is only 
59.496 in the test set. 





49 


Solutions to Computer Exercises 


Solution 1 


'This exercise covers some of the ideas and techniques discussed in Chapters 1 
and 2. 


(a) (i) Obtaining a scatterplot and displaying subgroups is described in 
Computer Activity 1.1. In the Simple Scatterplot dialogue box, enter 
metal in the Set Markers by field and (for later use) enter sample in 
the Label Cases by field. A scatterplot, edited so as to use different 
black plotting symbols for the different groups, and with outliers 
labelled, is shown in Figure 8.12. 


metal 


€ ca 
4 cr 
E cu 
t Ni 
X Pb 


log90s 





log30s 


Figure $.12  Scatterplot of log concentrations in the 1990s and 1930s 


(ii) The five unusual points are labelled in Figure 8.12. They correspond to 
the trees md, mm, p4b, s2a, s2b. 

(iii) In general, high log concentrations in the 1930s are paired with high log 
concentrations in the 1990s. There are some differences between metals; 
for example, the log concentrations of copper tend to be higher than 
those for cadmium. 


50 


Labelling points is described in 
Computer Activity 1.2. 


Solutions to Computer Exercises 


(b) (i) 
part (a)(ii) labelled, is shown in Figure S.13. 

















log80s — log90s 
9 "o 
D 
o 
o 
H, 3 
A 
o 
9o, 
ke] E 
AR AR 
o o 
ae od 
Es 9 "o 
RD A s 
o o 
Q 9 
"o 9 "o 
A A 
o o 
"o 
A 
o 
9 
"o 
A 
o 
































Oo 
Oo 
Ot 

















log70s 
"o 
o dè 
o 
R 
ej 
o a 
ke] 
A 
o 
3 
"o 
A 
o 




















log60s 
E E 
E. 
oz 
T 
E 
o 
"o 
A 
Gg 
o 
"o 
AR 
o 
g Oo 
Oo "o 
A 
o 
o 9 
"o 
E 
è 

















p4bP p4b p4b p4b p4b 


Uo 
pS 
ez 


log50s 
[o 
a 
o 
o 
o 
o 9 
























































e p4b p4b p4b p4b p4b o p4b 
p e mg s 
o 
Es] Oo Do o 000 
Nn 
e p4b oqp4b ocqp4b p4b p4b Ogpgp4b 
emo 
o 
Oo o Oo 
2 o9 % e 8 oo 





























log90s — log80s — log70s  log60s  1log50s  log40s  1log30s 


Figure $.13 Matrix scatterplot of log concentrations 


(ii) The five unusual points identified in part (a)(ii) are not unusual in all 


the other panels. For example, from Figure 8.13, sample p4b is not 
unusual in any of the panels not involving the variable 1og90s. 


(c) The highest mean log concentration is approximately 1.69, for the 1990s. The 


mean log concentrations increase over the decades, with a small dip in the 
1940s. 


(d) The smallest correlation, 0.509, is between the 1950s and 1990s, and the 
largest, 0.864, is between the 1970s and 1980s. The correlations are all 


positive, indicating positive relationships between log concentrations of heavy 


metals in different decades. 


Solution 2 


'This exercise covers some of the ideas and techniques discussed in Chapters 3 
and 4. 


(a) Means and standard deviations can be found using the Descriptives 
dialogue box. The Descriptive Statistics output table is reproduced 
below. 


Descriptive Statistics 


t uinimum [Maium | Wear | Std. Deviation 


proploss 
persloss 
stateexp 
localexp 
/alid N (listwise) 


in oc 


eo e 


J 
) hc co 


5 
86 
E 


P. 
— ~ 
= 


There is a large variation in the standard deviations of the four variables. 
Hence to avoid the PCA being dominated by the variable with the largest 
standard deviation (persloss), it is advisable to base the analysis on 
standardized variables. 


The matrix scatterplot, with one of the five unusual points identified in 





Obtaining a matrix scatterplot, 
and labelling a point on it, is 
described in Computer 
Activity 1.3. 


Obtaining the mean vector is 
described in Computer 
Activity 2.1. 


Obtaining the correlation 
matrix is described in Computer 
Activity 2.3. 


See Computer Activity 2.1. 


3 


Computer Book 3 


(b) Performing PCA is described in Computer Activity 3.1. To extract all four 
components, select Fixed number of factors and enter 4 in the Factors to 
extract field in the Factor Analysis: Extraction dialogue box (see 
Computer Activity 3.2). The variances are in the Total column of the Total 
Variance Explained output table, which is shown below. 


Total Variance Explained 


Initial Eigenvalues 
Component 


Extraction Sums of Squared Loadings 


Total % of Variance | Cumulative % Total % of Variance | Cumulative % 


52.797 52.797 
79.875 
94.947 


100.000 


2.112 2.112 


27.078 


1 
2 

3 15.072 
4 5.053 
Extraction Method: Principal Component Analysis. 





'The variances of the four components are, in order, 2.112, 1.083, 0.603 and 
0.202. 


(c) The scree plot is shown in Figure S.14. 


Scree Plot 


Eigenvalue 


1 2 3 4 


Component Number 


Figure $.14 Scree plot for the crime loss data 


The scree plot is not particularly helpful. There is an elbow at component 2, 
but it is not very marked. 


(d) The PVE and CPVE are given in the Total Variance Explained output 
table which is reproduced in the solution to part (b). The PVE by the first 


component is 52.797%, and that by the second component is 27.078%. The 
CPVE by the first two components is 79.875%. 


(e) The loadings are given in the Component Matrix output table, which is 
shown below. 


Component Matrix? 


Component 


proploss 
persloss 
stateexp 
localexp 





703 -34 
Extraction Method: Principal Component Analysis. 
3. 4 components extracted. 


02 


52.797 
27.078 
15.072 

5.053 


52.797 
79.875 
94.947 
100.000 


Obtaining a scree plot is 
described in Computer 
Activity 4.1. 


Solutions to Computer Exercises 


The loadings for the first component are all positive and relatively large, so 
this component represents a measure of the overall cost of crime, including 
loss and expenditure on policing. The second component is dominated by the 
variable stateexp; the loading for stateexp is positive. The second largest 
loading in absolute value, which is for localexp, is negative. So this 
component could perhaps be interpreted as a contrast between state and 
local expenditure on policing. 


(f) Extracting and saving principal components is described in Computer 
Activity 4.3. The labelling of points on scatterplots is described in Computer 
Activity 1.1. A scatterplot of the first two principal components with some 
unusual points identified is shown in Figure 8.15. 





















































4.004 
Delaware 
Alaska 
Q 
o 
o South Carolina 
N o o o 
& o 9 6 o 9 o 
e oo o 

00 

o of Q o o - 

o 
o 
° 3 d 93 
o 
9 9o o o o o 

-2.00 -] o 

T T T T T 

-2.00 -1.00 .00 1.00 2.00 

pci 


Figure $.15 Scatterplot of the first two principal components 


Most points lie around a line with negative slope. But there are several states 
that do not fit this pattern. Hawaii has a lower value of the second principal 
component than might be expected, being negative with a large absolute 
value. Alaska, Delaware, Massachusetts, Pennsylvania and South Carolina 
have a larger value of the second component than might be expected. Large 
absolute values of the second principal component indicate a large difference 
between (standardized) state and local expenditure. 


Solution 3 


'This exercise covers some of the ideas and techniques discussed in Chapters 5 
and 6. 


(a) There are three groups and four variables, so the minimum of G — 1 and p 
is 2, and hence there are two discriminant functions. The separations 
achieved are given in the Eigenvalue column of the Eigenvalues output 
table; they are 3.132 for the first discriminant function, and 0.004 for the 
second discriminant function. Thus the first discriminant function accounts 
for 99.996 of the achievable separation: the first discriminant function 
therefore suffices. 


Carrying out a canonical 
discriminant analysis is 
discussed in Computer 
Activity 5.1. 


DE 


Computer Book 3 


(b) Obtaining the loadings in terms of the unstandardized variables is described 
in Computer Activity 5.2, and extracting and saving values of the 
discriminant functions is described in Computer Activity 5.4. The loadings 
are in the Canonical Discriminant Function Coefficients output table. 
The first discriminant function Dj, is 


D, = 0.054 bandA + 0.059 bandB + 0.024 bandC + 0.023 bandD — 13.606. 


Note that this is —1 times the discriminant function described in 
Activity 13.4 of Book 3. 


(c) The stacked histograms are shown in Figure 8.16. See Computer Activity 5.4. 





100 4 


Jos A946 








Frequency 
Jos A946 dwep 
dnoJ6 














jos Ko46 duuep Alan 














disc 


Figure S.16 Stacked histograms of the first discriminant function 


The separation achieved is clearly far from perfect, since the histograms 
overlap substantially. 


(d) The group means for the discriminant function are given in the Functions 
at Group Centroids output table, as described in Computer Activity 5.4. 
The means are as follows: 2.024 for grey soil, —0.019 for damp grey soil, and 
—1.866 for very damp grey soil. 


The cutpoints for the allocation rule are the midpoints of the intervals 
determined by the group means, namely (to four decimal places) 

lı = (—1.866 — 0.019) /2 = —0.9425 and lə = (—0.019 + 2.024) /2 = 1.0025. 
(Note that when rounded to three decimal places, these are —1 times the 
cutpoints obtained in Example 13.6 of Book 3). 

(e) The allocated groups are obtained using Recode into Different Variables 
as described in Computer Activity 5.4. The coding in the Old --» New field 
of the Recode into Different Variables: Old and New Values dialogue 
box should be as follows. 


Lowest thru -0.9425 --> 3 
-0.9424 thru 1.0025 --> 2 
1.0026 thru Highest --> 1 


94 


Solutions to Computer Exercises 


The confusion matrix can be obtained from a crosstabulation of group and 
allocated, as described in Computer Activity 6.2. The confusion matrix is 
as follows. 


Allocated soil type 





Actual soil type Grey soil Damp grey soil Very damp grey soil 
Grey soil 86.996 12.996 0.296 
Damp grey soil 16.996 66.096 17.196 
Very damp grey soil 0.796 17.896 81.596 


The allocation rule is reasonably accurate for grey soil and very damp grey 
soil pixels: over 8096 of these are correctly classified. However, it is not 
accurate for pixels corresponding to damp grey soil: only 6696 of these are 
correctly classified. 


'The calculation of a confusion matrix for a test set is described in Computer 
Activity 6.4 and its solution. The confusion matrix is as follows. 


Allocated soil type 





Actual soil type Grey soil Damp grey soil Very damp grey soil 
Grey soil 86.1% 13.4% 0.5% 
Damp grey soil 11.4% 67.8% 20.9% 
Very damp grey soil 0.9% 19.1% 80.0% 


Overall, the percentage of the pixels correctly classified for the test set is 


100 x (342 + 143 + 376)/1078 ~ 79.9. 





For the training set, the percentage of the pixels correctly classified is 


100 x (835 + 274 + 846) /2414 ~ 81.0. 











Hence the overall misclassification rates are 20.1% for the test set and 19% 
for the training set. As expected, the misclassification rate is higher for the 
test set than for the training set, but only marginally so. 


90 


Index 


allocating observations to groups 31 
allocation 30 


Bivariate Correlations 14 


canonical discriminant analysis 25 
Compare Means 12 

confusion matrix 32 

Correlate 13 

correlation matrix 14 

Covariance matrix 19 

covariance matrix 14 

Crosstabs 32 

cumulative percentage variance explained 17 
customizing scatterplots 4 


Descriptives 12 

Dimension Reduction 16 
Discriminant Analysis 26-28 
discriminant analysis 25 

eigenvalue 21, 27 

extracting discriminant functions 28 
extracting principal components 23 


Extraction 19, 21 


Factor Analysis 16 
Fixed number of factors 22 


gun sight icon 6 
Histogram 29 


Kaiser’s criterion 17 


56 


labelling individual points 4, 6 
loadings 17 


matrix scatterplot 7 
mean vector 12 
Means 12 
misclassification rate 32 


percentage variance explained 17 
plotting discriminant functions 28 
plotting principal components 24 
plotting symbols 5 

principal component analysis 16 
profile plot 9 

Properties 5, 10 


recoding 31 
re-ordering categories 10 


Save 29 

Save as variables 23 
Scatter/Dot 4,7 

Scores 23 

Scree Plot 21 

separation 27 

specifying an allocation rule 30 
specifying the number of components 
stacked histograms 29 
standardizing variables 13 


Transpose 9 
transposing data 9 


Value Labels 31 


